文章詳情頁

Python selenium爬取微博數據代碼實例

瀏覽：140日期：2022-07-24 17:56:03

爬取某人的微博數據，把某人所有時間段的微博數據都爬下來。

具體思路：

創建driver-----get網頁----找到并提取信息-----保存csv----翻頁----get網頁（開始循環）----...----沒有“下一頁”就結束，

用了while True，沒用自我調用函數

嘟大海的微博：https://weibo.com/u/1623915527

辦公室小野的微博：https://weibo.com/bgsxy

代碼如下

from selenium import webdriverfrom selenium.webdriver.common.keys import Keysimport csvimport osimport time #只有這2個參數設置，想爬誰的微博數據就在這里改地址和目標csv名稱就行weibo_url = ’https://weibo.com/bgsxy?profile_ftype=1&is_all=1#_0’csv_name = ’bgsxy_allweibo.csv’ def start_chrome(): print(’開始創建瀏覽器’) driver = webdriver.Chrome(executable_path=’C:/Users/lori/Desktop/python52project/chromedriver_win32/chromedriver.exe’) driver.start_client() return driver def get_web(url): #獲取網頁，并下拉到最底部 print(’開始打開指定網頁’) driver.get(url) time.sleep(7) scoll_down() time.sleep(5) def scoll_down(): # 滾輪下拉到最底部 html_page = driver.find_element_by_tag_name(’html’) for i in range(7): print(i) html_page.send_keys(Keys.END) time.sleep(1) def get_data(): print(’開始查找并提取數據’) card_sel = ’div.WB_cardwrap.WB_feed_type’ time_sel = ’a.S_txt2[node-type='feed_list_item_date']’ source_sel = ’a.S_txt2[suda-uatrack='key=profile_feed&value=pubfrom_guest']’ content_sel = ’div.WB_text.W_f14’ interact_sel = ’span.line.S_line1>span>em:nth-child(2)’ cards = driver.find_elements_by_css_selector(card_sel) info_list = [] for card in cards: time = card.find_elements_by_css_selector(time_sel)[0].text #雖然有可能在一個card中有2個time元素，我們取第一個就對 if card.find_elements_by_css_selector(source_sel): source = card.find_elements_by_css_selector(source_sel)[0].text else: source = ’’ content = card.find_elements_by_css_selector(content_sel)[0].text link = card.find_elements_by_css_selector(time_sel)[0].get_attribute(’href’) trans = card.find_elements_by_css_selector(interact_sel)[1].text comment = card.find_elements_by_css_selector(interact_sel)[2].text like = card.find_elements_by_css_selector(interact_sel)[3].text info_list.append([time,source,content,link,trans,comment,like]) return info_list def save_csv(info_list,csv_name): csv_path = ’./’ + csv_name print(’開始寫入csv文件’) if os.path.exists(csv_path): with open(csv_path,’a’,newline=’’,encoding=’utf-8-sig’) as f: #newline=’’避免空行；encoding=’utf-8-sig’比utf8牛，保存中文沒問題 writer = csv.writer(f) writer.writerows(info_list) else: with open(csv_path,’w+’,newline=’’,encoding=’utf-8-sig’) as f: writer = csv.writer(f) writer.writerow([’發布時間’,’來源’,’內容’,’鏈接’,’轉發數’,’評論數’,’點贊數’]) writer.writerows(info_list) time.sleep(5) def next_page_url(): next_page_sel = ’a.page.next’ next_page_ele = driver.find_elements_by_css_selector(next_page_sel) if next_page_ele: return next_page_ele[0].get_attribute(’href’) else: return None driver = start_chrome()input(’請在chrome中登錄weibo.com’) # 暫停程序，手動登錄weibo.com while True: get_web(weibo_url) info_list = get_data() save_csv(info_list,csv_name) if next_page_url(): weibo_url = next_page_url() else: print(’爬取結束’) break

以上就是本文的全部內容，希望對大家的學習有所幫助，也希望大家多多支持好吧啦網。

微博 Python

上一條：Python多個裝飾器的調用順序實例解析下一條：Tensorflow卷積實現原理+手寫python代碼實現卷積教程

相關文章：

1. python爬蟲實戰之制作屬于自己的一個IP代理模塊2. python實現讀取類別頻數數據畫水平條形圖案例3. 如何理解PHP核心特性命名空間4. python 利用toapi庫自動生成api5. Android Studio設置顏色拾色器工具Color Picker教程6. python操作數據庫獲取結果之fetchone和fetchall的區別說明7. python中PyQuery庫用法分享8. Springboot設置默認訪問路徑方法實現9. 小技巧處理div內容溢出10. Android Studio 2.0 功能介紹

排行榜

					
					python爬蟲實戰之制作屬于自己的一個IP代理模塊
python實現讀取類別頻數數據畫水平條形圖案例
如何理解PHP核心特性命名空間
python 利用toapi庫自動生成api
Android Studio設置顏色拾色器工具Color Picker教程
python操作數據庫獲取結果之fetchone和fetchall的區別說明
python中PyQuery庫用法分享
django自定義非主鍵自增字段類型詳解(auto increment field)
Android Studio 2.0 功能介紹
在IDEA里gradle配置和使用的方法步驟
小技巧處理div內容溢出