国产成人精品久久免费动漫-国产成人精品天堂-国产成人精品区在线观看-国产成人精品日本-a级毛片无码免费真人-a级毛片毛片免费观看久潮喷

您的位置:首頁(yè)技術(shù)文章
文章詳情頁(yè)

Python爬蟲(chóng)實(shí)戰(zhàn)之使用Scrapy爬取豆瓣圖片

瀏覽:100日期:2022-06-14 16:32:24
使用Scrapy爬取豆瓣某影星的所有個(gè)人圖片

以莫妮卡·貝魯奇為例

Python爬蟲(chóng)實(shí)戰(zhàn)之使用Scrapy爬取豆瓣圖片

1.首先我們?cè)诿钚羞M(jìn)入到我們要?jiǎng)?chuàng)建的目錄,輸入 scrapy startproject banciyuan 創(chuàng)建scrapy項(xiàng)目

創(chuàng)建的項(xiàng)目結(jié)構(gòu)如下

Python爬蟲(chóng)實(shí)戰(zhàn)之使用Scrapy爬取豆瓣圖片

2.為了方便使用pycharm執(zhí)行scrapy項(xiàng)目,新建main.py

from scrapy import cmdlinecmdline.execute('scrapy crawl banciyuan'.split())

再edit configuration

Python爬蟲(chóng)實(shí)戰(zhàn)之使用Scrapy爬取豆瓣圖片

然后進(jìn)行如下設(shè)置,設(shè)置后之后就能通過(guò)運(yùn)行main.py運(yùn)行scrapy項(xiàng)目了

Python爬蟲(chóng)實(shí)戰(zhàn)之使用Scrapy爬取豆瓣圖片

3.分析該HTML頁(yè)面,創(chuàng)建對(duì)應(yīng)spider

Python爬蟲(chóng)實(shí)戰(zhàn)之使用Scrapy爬取豆瓣圖片

from scrapy import Spiderimport scrapyfrom banciyuan.items import BanciyuanItemclass BanciyuanSpider(Spider): name = ’banciyuan’ allowed_domains = [’movie.douban.com’] start_urls = ['https://movie.douban.com/celebrity/1025156/photos/'] url = 'https://movie.douban.com/celebrity/1025156/photos/' def parse(self, response):num = response.xpath(’//div[@class='paginator']/a[last()]/text()’).extract_first(’’)print(num)for i in range(int(num)): suffix = ’?type=C&start=’ + str(i * 30) + ’&sortby=like&size=a&subtype=a’ yield scrapy.Request(url=self.url + suffix, callback=self.get_page) def get_page(self, response):href_list = response.xpath(’//div[@class='article']//div[@class='cover']/a/@href’).extract()# print(href_list)for href in href_list: yield scrapy.Request(url=href, callback=self.get_info) def get_info(self, response):src = response.xpath( ’//div[@class='article']//div[@class='photo-show']//div[@class='photo-wp']/a[1]/img/@src’).extract_first(’’)title = response.xpath(’//div[@id='content']/h1/text()’).extract_first(’’)# print(response.body)item = BanciyuanItem()item[’title’] = titleitem[’src’] = [src]yield item

4.items.py

# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass BanciyuanItem(scrapy.Item): # define the fields for your item here like: src = scrapy.Field() title = scrapy.Field()

pipelines.py

# Define your item pipelines here## Don’t forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interfacefrom itemadapter import ItemAdapterfrom scrapy.pipelines.images import ImagesPipelineimport scrapyclass BanciyuanPipeline(ImagesPipeline): def get_media_requests(self, item, info):yield scrapy.Request(url=item[’src’][0], meta={’item’: item}) def file_path(self, request, response=None, info=None, *, item=None):item = request.meta[’item’]image_name = item[’src’][0].split(’/’)[-1]# image_name.replace(’.webp’, ’.jpg’)path = ’%s/%s’ % (item[’title’].split(’ ’)[0], image_name)return path

settings.py

# Scrapy settings for banciyuan project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:## https://docs.scrapy.org/en/latest/topics/settings.html# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = ’banciyuan’SPIDER_MODULES = [’banciyuan.spiders’]NEWSPIDER_MODULE = ’banciyuan.spiders’# Crawl responsibly by identifying yourself (and your website) on the user-agentUSER_AGENT = {’User-Agent’:’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36’}# Obey robots.txt rulesROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = 3# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False# Override the default request headers:#DEFAULT_REQUEST_HEADERS = {# ’Accept’: ’text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8’,# ’Accept-Language’: ’en’,#}# Enable or disable spider middlewares# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {# ’banciyuan.middlewares.BanciyuanSpiderMiddleware’: 543,#}# Enable or disable downloader middlewares# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {# ’banciyuan.middlewares.BanciyuanDownloaderMiddleware’: 543,#}# Enable or disable extensions# See https://docs.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = {# ’scrapy.extensions.telnet.TelnetConsole’: None,#}# Configure item pipelines# See https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { ’banciyuan.pipelines.BanciyuanPipeline’: 1,}IMAGES_STORE = ’./images’# Enable and configure the AutoThrottle extension (disabled by default)# See https://docs.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = ’httpcache’#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = ’scrapy.extensions.httpcache.FilesystemCacheStorage’

5.爬取結(jié)果

Python爬蟲(chóng)實(shí)戰(zhàn)之使用Scrapy爬取豆瓣圖片

reference

源碼

到此這篇關(guān)于Python爬蟲(chóng)實(shí)戰(zhàn)之使用Scrapy爬取豆瓣圖片的文章就介紹到這了,更多相關(guān)Scrapy爬取豆瓣圖片內(nèi)容請(qǐng)搜索好吧啦網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持好吧啦網(wǎng)!

標(biāo)簽: 豆瓣 Python 編程語(yǔ)言
相關(guān)文章:
主站蜘蛛池模板: 日本一本黄| 免费一级网站免费 | 国产浮力第一页草草影院 | 免费 视频 1级 | 国产精品爱久久久久久久小 | 2022免费国产精品福利在线 | 久久熟| 成人观看的视频三级 | 成人在线一区二区三区 | 亚洲视频2 | 国产欧美日本亚洲精品五区 | 亚洲成人www | 国产在线一区二区三区 | 日本三级一区二区三区 | 美女黄18 | 美女张开腿给男生桶下面视频 | 久久精品视频7 | 国产99视频精品免费视频7 | 成人9久久国产精品品 | bt天堂国产亚洲欧美在线 | 最新亚洲精品国自产在线 | 千涩成人网 | 久久99国产精品久久99果冻传媒 | 国产欧美日韩亚洲精品区2345 | 欧美a欧美| 国产一区二区三区不卡在线观看 | 欧美最黄视频 | 国产韩国精品一区二区三区 | 国产一级强片在线观看 | 亚洲一区二区三区不卡在线播放 | 亚洲国产成人久久综合一区 | 成人免费网站视频 | 亚洲欧美网站 | 成人午夜网站 | 91成人免费观看在线观看 | 日韩三级视频在线 | 国产成人综合手机在线播放 | 亚洲欧美韩日 | 久久一本色系列综合色 | 高清午夜毛片 | 一级黄视频 |