飘云阁Python Selenium爬取妹子图 - Powered by Discuz! Archiver

zyjsuper 发表于 2020-8-22 21:29:14

Python Selenium爬取妹子图

本帖最后由 zyjsuper 于 2020-8-22 21:38 编辑

Selenium的效率的确不敢恭维啊，不知道哪位大神分享下并发的策略，不胜感激啊。

#!/usr/bin/env python
# -*- encoding: utf-8 -*-

'''
@Author: zyjsuper
@License : (C) Copyright 2013-2020
@Contact : [email protected]
@File : MztSpider.py
@Time : 2020/8/17 20:23
@Desc :
'''

from selenium import webdriver
import requests,os

def get_pic(page):
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(options=chrome_options)
browser.get("https://www.mzitu.com/page/" + str(page))

headers = {
         'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3314.0 Safari/537.36 SE 2.X MetaSr 1.0',
         'Referer': 'https://www.mzitu.com/**/'
}

username = os.getenv("USERNAME")
savepath = "C:\\Users\\"+ username + "\\Desktop\\Meizitu"
try:
   os.mkdir(savepath)
except:
   pass
os.mkdir(savepath + "\\page-" + str(page) + "\\")
links= browser.find_elements_by_xpath("//ul[@id='pins']/li/a")
urls = []
for link in links:
   url = link.get_attribute("href")
   urls.append(url)

for url in urls:
   chrome_options = webdriver.ChromeOptions()
   chrome_options.add_argument('--headless')
   browser = webdriver.Chrome(options=chrome_options)
   browser.get(url)
   pic_url = browser.find_element_by_xpath("//div[@class='main-image']//p//a//img").get_attribute("src")
   name =str(pic_url).split('/')[-1]
   response = requests.get(pic_url,headers=headers)
   print("获取图片%s,图片地址为%s。" %(name,pic_url))
   with open(savepath + "\\page-" + str(page) + "\\" + name,"wb") as file:
         file.write(response.content)
browser.quit()

if __name__ == '__main__':
for p in range(1,254):             #从第一页到第254页
   get_pic(p)

ccwuax 发表于 2020-8-23 23:52:10

你要想快自然需要多线程或多进程，但很容易被封IP，然后就需要代理，反反爬虫，路漫漫其修远兮

superroshan 发表于 2020-8-24 10:26:40

Pyppeteer 这个框架比Selenium要好用一些

zyjsuper 发表于 2020-8-24 19:39:27

superroshan 发表于 2020-8-24 10:26
Pyppeteer 这个框架比Selenium要好用一些

感谢回复。

zyjsuper 发表于 2020-8-24 19:40:09

ccwuax 发表于 2020-8-23 23:52
你要想快自然需要多线程或多进程，但很容易被封IP，然后就需要代理，反反爬虫，路漫漫其修远兮

感谢回复。

shrover1 发表于 2020-8-25 13:55:44

不明觉厉！！

页: [1]

飘云阁's Archiver

Python Selenium爬取妹子图