文章詳情頁(yè)

python多線(xiàn)程爬取西刺代理的示例代碼

瀏覽：63日期：2022-06-28 16:59:26

西刺代理是一個(gè)國(guó)內(nèi)IP代理，由于代理倒閉了，所以我就把原來(lái)的代碼放出來(lái)供大家學(xué)習(xí)吧。

鏡像地址：https://www.blib.cn/url/xcdl.html

首先找到所有的tr標(biāo)簽，與class='odd'的標(biāo)簽，然后提取出來(lái)。

然后再依次找到tr標(biāo)簽里面的所有td標(biāo)簽，然后只提取出里面的[1,2,5,9]這四個(gè)標(biāo)簽的位置，其他的不提取。

python多線(xiàn)程爬取西刺代理的示例代碼

最后可以寫(xiě)出提取單一頁(yè)面的代碼，提取后將其保存到文件中。

爬取后會(huì)將文件保存為 SpiderAddr.json 格式。

python多線(xiàn)程爬取西刺代理的示例代碼

最后再使用另一段代碼，將其轉(zhuǎn)換為一個(gè)SSR代理工具直接能識(shí)別的格式，{’http’: ’http://119.101.112.31:9999’}

import sys,re,threadingimport requests,lxmlfrom queue import Queueimport argparsefrom bs4 import BeautifulSoupif __name__ == '__main__': result = [] fp = open('SpiderAddr.json','r') data = fp.readlines() for item in data: dic = {} read_line = eval(item.replace('n','')) Protocol = read_line[2].lower() if Protocol == 'http': dic[Protocol] = 'http://' + read_line[0] + ':' + read_line[1] else: dic[Protocol] = 'https://' + read_line[0] + ':' + read_line[1] result.append(dic) print(result)

python多線(xiàn)程爬取西刺代理的示例代碼

完整多線(xiàn)程版代碼如下所示。

import sys,re,threadingimport requests,lxmlfrom queue import Queueimport argparsefrom bs4 import BeautifulSouphead = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'}class AgentSpider(threading.Thread): def __init__(self,queue): threading.Thread.__init__(self) self._queue = queue def run(self): ip_list=[] fp = open('SpiderAddr.json','a+',encoding='utf-8') while not self._queue.empty(): url = self._queue.get() try:request = requests.get(url=url,headers=head)soup = BeautifulSoup(request.content,'lxml')data = soup.find_all(name='tr',attrs={'class': re.compile('|[^odd]')})for item in data: soup_proxy = BeautifulSoup(str(item),'lxml') proxy_list = soup_proxy.find_all(name='td') for i in [1,2,5,9]: ip_list.append(proxy_list[i].string) print('[+] 爬行列表: {} 已轉(zhuǎn)存'.format(ip_list)) fp.write(str(ip_list) + ’n’) ip_list.clear() except Exception:passdef StartThread(count): queue = Queue() threads = [] for item in range(1,int(count)+1): url = 'https://www.xicidaili.com/nn/{}'.format(item) queue.put(url) print('[+] 生成爬行鏈接 {}'.format(url)) for item in range(count): threads.append(AgentSpider(queue)) for t in threads: t.start() for t in threads: t.join()# 轉(zhuǎn)換函數(shù)def ConversionAgentIP(FileName): result = [] fp = open(FileName,'r') data = fp.readlines() for item in data: dic = {} read_line = eval(item.replace('n','')) Protocol = read_line[2].lower() if Protocol == 'http': dic[Protocol] = 'http://' + read_line[0] + ':' + read_line[1] else: dic[Protocol] = 'https://' + read_line[0] + ':' + read_line[1] result.append(dic) return resultif __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('-p','--page',dest='page',help='指定爬行多少頁(yè)') parser.add_argument('-f','--file',dest='file',help='將爬取到的結(jié)果轉(zhuǎn)化為代理格式 SpiderAddr.json') args = parser.parse_args() if args.page: StartThread(int(args.page)) elif args.file: dic = ConversionAgentIP(args.file) for item in dic: print(item) else: parser.print_help()

以上就是python多線(xiàn)程爬取西刺代理的示例代碼的詳細(xì)內(nèi)容，更多關(guān)于python多線(xiàn)程爬取代理的資料請(qǐng)關(guān)注好吧啦網(wǎng)其它相關(guān)文章！

Python 編程

上一條：Python 中Operator模塊的使用下一條：Python如何實(shí)現(xiàn)Paramiko的二次封裝

相關(guān)文章：

1. Python如何批量生成和調(diào)用變量2. python利用opencv實(shí)現(xiàn)顏色檢測(cè)3. windows服務(wù)器使用IIS時(shí)thinkphp搜索中文無(wú)效問(wèn)題4. ASP.NET MVC實(shí)現(xiàn)橫向展示購(gòu)物車(chē)5. Python基于requests實(shí)現(xiàn)模擬上傳文件6. Python sorted排序方法如何實(shí)現(xiàn)7. Python 中如何使用 virtualenv 管理虛擬環(huán)境8. ASP.Net Core(C#)創(chuàng)建Web站點(diǎn)的實(shí)現(xiàn)9. Python獲取B站粉絲數(shù)的示例代碼10. 通過(guò)CSS數(shù)學(xué)函數(shù)實(shí)現(xiàn)動(dòng)畫(huà)特效

排行榜

					
					Python如何批量生成和調(diào)用變量
PHP實(shí)現(xiàn)基本留言板功能原理與步驟詳解
Python sorted排序方法如何實(shí)現(xiàn)
python利用opencv實(shí)現(xiàn)顏色檢測(cè)
每日六道java新手入門(mén)面試題,通往自由的道路第二天
Python基于requests實(shí)現(xiàn)模擬上傳文件
ASP.NET MVC實(shí)現(xiàn)橫向展示購(gòu)物車(chē)
vue的ssr服務(wù)端渲染示例詳解
簡(jiǎn)體中文轉(zhuǎn)換為繁體中文的PHP函數(shù)
ASP.Net Core(C#)創(chuàng)建Web站點(diǎn)的實(shí)現(xiàn)
windows服務(wù)器使用IIS時(shí)thinkphp搜索中文無(wú)效問(wèn)題
				

国产成人精品久久免费动漫-国产成人精品天堂-国产成人精品区在线观看-国产成人精品日本-a级毛片无码免费真人-a级毛片毛片免费观看久潮喷

python多線(xiàn)程爬取西刺代理的示例代碼