项目概述:相信很多小伙伴都有用过新浪微博,因为这是当今很火的一款社交app。正因为这样,我们需要获取新浪微博中每一个用户的信息以及评论、发布时间等来满足公司的需求,获取每日热点、评论量、点赞量等相关信息。如今是一个大数据的时代,得数据者得天下,下面教大家如何抓取新浪微博的数据。
首先需要安装python环境(python2.7以及scrapy+selenium+phantomjs+chrome)
一、python2.7+scrapy+ selenium+ phantomjs安装:
以下例子基于python 2.7.9,其他版本同理。
1、下载python
wget https://www.python.org/ftp/python/2.7.9/Python-2.7.9.tgz
2、解压、编译安装(依次执行以下5条命令)
tar -zxvf Python-2.7.9.tgzcd Python-2.7.9 ./configure --prefix=/usr/local/python-2.7.9 make make install
3、系统自带了python版本,我们需要为新安装的版本添加一个软链
ln -s /usr/local/python-2.7.9/bin/python /usr/bin/python2.7.9
4、若需使用该版本,只需输入"python2.7.9 + 空格 + py脚本"
python2.7.9 ~/helloworld.py
scrapy安装:
pip install scrapy# 如果用到分布式pip install scrapy_redis
selenium安装:
pip install selenium
phantomjs安装:
将PhantomJS下载在/usr/local/src/packet/目录下(这个看个人喜好)
操作系统:CentOS 7 64-bit
下载好后进行解压(由于是bz2格式,要先进行bzip2解压成tar格式,再使用tar解压)
bzip2 -d phantomjs-2.1.1-linux-x86_64.tar.bz2
4.再使用tar进行解压到/usr/local/目录下边
安装依赖软件
tar xvf phantomjs-2.1.1-linux-x86_64.tar -C /usr/local/ yum -y install wget fontconfig# 重命名(方便以后使用phantomjs命令)mv /usr/local/phantomjs-2.1.1-linux-x86_64/ /usr/local/phantomjs
6.最后一步就是建立软连接了(在/usr/bin/目录下生产一个phantomjs的软连接,/usr/bin/是啥目录应该清楚,不清楚使用 echo $PATH查看)
ln -s /usr/local/phantomjs/bin/phantomjs /usr/bin/
到这一步就安装成功了,接下来测试一下(经过上面建立的软连接,你就可以使用了,而且是想使用命令一样的进行使用哦!):
[root@localhost ~]# phantomjs
二、chrome安装:
说明:要在服务器上安装chrome运行环境,使他能够同selenium自动化测试脚本一同抓取数据,那么久需要配置chrome依赖环境。
selenium+chromedriver在服务器运行
1.前言
想使用selenium从网站上抓数据,但有时候使用phantomjs会出错。chrome现在也有无界面运行模式了,以后就可以不用phantomjs了。
但在服务器安装chrome时出现了一些错误,这里总结一下整个安装过程
2.ubuntu上安装chrome
# Install Google Chrome# https://askubuntu.com/questions/79280/how-to-install-chrome-browser-properly-via-command-linesudo apt-get install libxss1 libappindicator1 libindicator7wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.debsudo dpkg -i google-chrome*.deb # Might show "errors", fixed by next linesudo apt-get install -f
这时应该已经安装好了,用下边的命行运行测试一下:
google-chrome --headless --remote-debugging-port=9222 https://chromium.org --disable-gpu
这里是使用headless模式进行远程调试,ubuntu上大多没有gpu,所以–disable-gpu以免报错。
之后可以再打开一个ssh连接到服务器,使用命令行访问服务器的本地的9222端口:
curl http://localhost:9222
如果安装好了,会看到调试信息。但我这里会报一个错误,下边是错误的解决办法。
一、可能的错误解决方法
运行完上边的命令可能会报一个不能在root下运行chrome的错误。这个时候使用下边方设置一下chrome
(1).找到google-chrome文件
我的位置位于/opt/google/chrome/
(2).用vi打开google-chrome文件
vi /opt/google/chrome/google-chrome
在文件中找到
exec -a "$0" "$HERE/chrome" "$@"
(3).在后面添加 –user-data-dir –no-sandbox
即可,整条shell命令就是
exec -a "$0" "$HERE/chrome" "$@" --user-data-dir --no-sandbox
(4).再重新打开google-chrome即可正常访问!
3.安装chrome驱动chromedriver
下载chromedriver
chromedriver提供了操作chrome的api,是selenium控制chrome的桥梁。
chromedriver最好安装最新版的,记的我一开始安装的不是最新版的,会报一个错。用最新版的chromedriver就没有问题,最新版的可以在下边地址找到
https://sites.google.com/a/chromium.org/chromedriver/downloads
我写这个文章时最新版是2.37
wget https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zipunzip chromedriver_linux64.zip
到这里服务器端的无界面版chrome就安装好了。
4.无界面版chrome使用方法
from selenium import webdriver chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') chrome_options.add_argument("user-agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'") wd = webdriver.Chrome(chrome_options=chrome_options,executable_path='/home/chrome/chromedriver') wd.get("https://www.163.com") content = wd.page_source.encode('utf-8')print content wd.quit()
三、抓取数据:
抓取新浪微博,我们需要模拟登陆,登陆成功后获取cookie进行保存,为了不被封禁账号,我们需要用很多微博账号来进行抓取(根据你的数据量的需求提供账号的多少)
1.首先模拟登陆获取cookie
#!/usr/bin/env python# encoding: utf-8import datetimeimport jsonimport base64from time import sleepimport pymongofrom selenium import webdriverfrom selenium.webdriver import ActionChainsfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECimport sys reload(sys) sys.setdefaultencoding('utf-8')
输入你的微博账号和密码,可去淘宝买,一元七个。
建议买几十个,微博反扒的厉害,太频繁了会出现302转移。
或者你也可以把时间间隔调大点。
WeiBoAccounts = [ {'username': 'javzx61369@game.weibo.com', 'password': 'esdo77127'}, {'username': 'v640e2@163.com', 'password': 'wy539067'}, {'username': 'd3fj3l@163.com', 'password': 'af730743'}, {'username': 'oia1xs@163.com', 'password': 'tw635958'}, ]''' WeiBoAccounts = [{'username': '你的用户名', 'password': '你的密码'}] cookies = [] client = pymongo.MongoClient("192.168.98.5", 27017) db = client["Sina"] userAccount = db["userAccount"] def get_cookie_from_weibo(username, password): driver = webdriver.PhantomJS() driver.get('https://weibo.cn') print driver.title assert "微博" in driver.title login_link = driver.find_element_by_link_text('登录') ActionChains(driver).move_to_element(login_link).click().perform() login_name = WebDriverWait(driver, 10).until( EC.visibility_of_element_located((By.ID, "loginName")) ) login_password = driver.find_element_by_id("loginPassword") login_name.send_keys(username) login_password.send_keys(password) login_button = driver.find_element_by_id("loginAction") login_button.click() # 这里停留了10秒观察一下启动的Chrome是否登陆成功了,没有的化手动登陆进去 sleep(10) cookie = driver.get_cookies() #print driver.page_source print driver.current_url driver.close() return cookie def init_cookies(): for cookie in userAccount.find(): cookies.append(cookie['cookie'])if __name__ == "__main__": try: userAccount.drop() except Exception as e: pass for account in WeiBoAccounts: cookie = get_cookie_from_weibo(account["username"], account["password"]) userAccount.insert_one({"_id": account["username"], "cookie": cookie}) init_cookies()
代码很简单。就是模拟登陆获取cookie插入到mongo数据库中!方便以后请求数据进行使用。init_cookies()这个函数供middleware中间件后期使用。
*大沙发
大厦
scrapy items代码如下:
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlfrom scrapy import Item, Fieldclass InformationItem(Item): """ 个人信息 """ _id = Field() # 用户ID NickName = Field() # 昵称 Gender = Field() # 性别 Province = Field() # 所在省 City = Field() # 所在城市 BriefIntroduction = Field() # 简介 Birthday = Field() # 生日 Num_Tweets = Field() # 微博数 Num_Follows = Field() # 关注数 Num_Fans = Field() # 粉丝数 SexOrientation = Field() # 性取向 Sentiment = Field() # 感情状况 VIPlevel = Field() # 会员等级 Authentication = Field() # 认证 URL = Field() # 首页链接class TweetsItem(Item): """ 微博信息 """ _id = Field() # 用户ID-微博ID ID = Field() # 用户ID Content = Field() # 微博内容 PubTime = Field() # 发表时间 Co_oridinates = Field() # 定位坐标 Tools = Field() # 发表工具/平台 Like = Field() # 点赞数 Comment = Field() # 评论数 Transfer = Field() # 转载数 filepath = Field()class RelationshipsItem(Item): """ 用户关系,只保留与关注的关系 """ fan_id = Field() followed_id = Field() # 被关注者的ID
2.初始请求代码:
#!/usr/bin/env python# encoding: utf-8""" 初始的待爬队列 """weiboID = [ #"5303798085" #'6033587203' '6092234294']
3.scrapy spider代码如下:
此代码主要是用于解析页面中获取的数据信息,请详阅# encoding: utf-8import datetimeimport requestsimport refrom lxml import etreefrom scrapy import Spiderfrom scrapy.selector import Selectorfrom scrapy.http import Requestfrom sina.config import weiboIDfrom sina.items import TweetsItem, InformationItem, RelationshipsItemimport timeimport randomdef rand_num(): number = "" for i in range(5): number += str(random.randint(0,9)) return numberclass Spider(Spider): name = "SinaSpider" host = "https://weibo.cn" start_urls = list(set(weiboID)) filepath = '/home/YuQing/content/' def start_requests(self): for uid in self.start_urls: yield Request(url="https://weibo.cn/%s/info" % uid, callback=self.parse_information) def parse_information(self, response): """ 抓取个人信息 """ informationItem = InformationItem() selector = Selector(response) ID = re.findall('(\d+)/info', response.url)[0] print response.url, response.body try: text1 = ";".join(selector.xpath('body/div[@class="c"]//text()').extract()) # 获取标签里的所有text() nickname = re.findall('昵称;?[::]?(.*?);', text1) gender = re.findall('性别;?[::]?(.*?);', text1) place = re.findall('地区;?[::]?(.*?);', text1) briefIntroduction = re.findall('简介;?[::]?(.*?);', text1) birthday = re.findall('生日;?[::]?(.*?);', text1) sexOrientation = re.findall('性取向;?[::]?(.*?);', text1) sentiment = re.findall('感情状况;?[::]?(.*?);', text1) vipLevel = re.findall('会员等级;?[::]?(.*?);', text1) authentication = re.findall('认证;?[::]?(.*?);', text1) url = re.findall('互联网;?[::]?(.*?);', text1) informationItem["_id"] = ID if nickname and nickname[0]: informationItem["NickName"] = nickname[0].replace(u"\xa0", "") if gender and gender[0]: informationItem["Gender"] = gender[0].replace(u"\xa0", "") if place and place[0]: place = place[0].replace(u"\xa0", "").split(" ") informationItem["Province"] = place[0] if len(place) > 1: informationItem["City"] = place[1] if briefIntroduction and briefIntroduction[0]: informationItem["BriefIntroduction"] = briefIntroduction[0].replace(u"\xa0", "") if birthday and birthday[0]: try: birthday = datetime.datetime.strptime(birthday[0], "%Y-%m-%d") informationItem["Birthday"] = birthday - datetime.timedelta(hours=8) except Exception: informationItem['Birthday'] = birthday[0] # 有可能是星座,而非时间 if sexOrientation and sexOrientation[0]: if sexOrientation[0].replace(u"\xa0", "") == gender[0]: informationItem["SexOrientation"] = "同性恋" else: informationItem["SexOrientation"] = "异性恋" if sentiment and sentiment[0]: informationItem["Sentiment"] = sentiment[0].replace(u"\xa0", "") if vipLevel and vipLevel[0]: informationItem["VIPlevel"] = vipLevel[0].replace(u"\xa0", "") if authentication and authentication[0]: informationItem["Authentication"] = authentication[0].replace(u"\xa0", "") if url: informationItem["URL"] = url[0] try: urlothers = "https://weibo.cn/attgroup/opening?uid=%s" % ID new_ck = {} for ck in response.request.cookies: new_ck[ck['name']] = ck['value'] r = requests.get(urlothers, cookies=new_ck, timeout=5) if r.status_code == 200: selector = etree.HTML(r.content) texts = ";".join(selector.xpath('//body//div[@class="tip2"]/a//text()')) print texts if texts: # num_tweets = re.findall(r'微博\[(\d+)\]', texts) num_tweets = texts.split(';')[0].replace('微博[', '').replace(']','') # num_follows = re.findall(r'关注\[(\d+)\]', texts) num_follows = texts.split(';')[1].replace('关注[', '').replace(']','') # num_fans = re.findall(r'粉丝\[(\d+)\]', texts) num_fans = texts.split(';')[2].replace('粉丝[', '').replace(']','') if len(num_tweets) > 0: informationItem["Num_Tweets"] = int(num_tweets) if num_follows: informationItem["Num_Follows"] = int(num_follows) if num_fans: informationItem["Num_Fans"] = int(num_fans) except Exception as e: print e except Exception as e: pass else: yield informationItem if informationItem["Num_Tweets"] and informationItem["Num_Tweets"] < 5000: yield Request(url="https://weibo.cn/%s/profile?filter=1&page=1" % ID, callback=self.parse_tweets, dont_filter=True) if informationItem["Num_Follows"] and informationItem["Num_Follows"] < 500: yield Request(url="https://weibo.cn/%s/follow" % ID, callback=self.parse_relationship, dont_filter=True) if informationItem["Num_Fans"] and informationItem["Num_Fans"] < 500: yield Request(url="https://weibo.cn/%s/fans" % ID, callback=self.parse_relationship, dont_filter=True) def parse_tweets(self, response): """ 抓取微博数据 """ selector = Selector(response) ID = re.findall('(\d+)/profile', response.url)[0] divs = selector.xpath('body/div[@class="c" and @id]') for div in divs: try: tweetsItems = TweetsItem() id = div.xpath('@id').extract_first() # 微博ID content = div.xpath('div/span[@class="ctt"]//text()').extract() # 微博内容 cooridinates = div.xpath('div/a/@href').extract() # 定位坐标 like = re.findall('赞\[(\d+)\]', div.extract()) # 点赞数 transfer = re.findall('转发\[(\d+)\]', div.extract()) # 转载数 comment = re.findall('评论\[(\d+)\]', div.extract()) # 评论数 others = div.xpath('div/span[@class="ct"]/text()').extract() # 求时间和使用工具(手机或平台) tweetsItems["_id"] = ID + "-" + id tweetsItems["ID"] = ID if content: tweetsItems["Content"] = " ".join(content).strip('[位置]') # 去掉最后的"[位置]" if cooridinates: cooridinates = re.findall('center=([\d.,]+)', cooridinates[0]) if cooridinates: tweetsItems["Co_oridinates"] = cooridinates[0] if like: tweetsItems["Like"] = int(like[0]) if transfer: tweetsItems["Transfer"] = int(transfer[0]) if comment: tweetsItems["Comment"] = int(comment[0]) if others: others = others[0].split('来自') tweetsItems["PubTime"] = others[0].replace(u"\xa0", "") if len(others) == 2: tweetsItems["Tools"] = others[1].replace(u"\xa0", "") filename = 'wb_'+time.strftime('%Y%m%d%H%M%S')+'_'+rand_num()+'.txt' tweetsItems["filepath"] = self.filepath + filename yield tweetsItems except Exception as e: print e,111111111111111111111 self.logger.info(e) pass next_page = '下页'.decode('utf-8') url_next = selector.xpath('body/div[@class="pa" and @id="pagelist"]/form/div/a[text()="%s"]/@href' % next_page).extract() if url_next: yield Request(url=self.host + url_next[0], callback=self.parse_tweets, dont_filter=True) def parse_relationship(self, response): """ 打开url爬取里面的个人ID """ selector = Selector(response) if "/follow" in response.url: ID = re.findall('(\d+)/follow', response.url)[0] flag = True else: ID = re.findall('(\d+)/fans', response.url)[0] flag = False he = "关注他".decode('utf-8') she = "关注她".decode('utf-8') urls = selector.xpath('//a[text()="%s" or text()="%s"]/@href' % (he, she)).extract() uids = re.findall('uid=(\d+)', ";".join(urls), re.S) for uid in uids: relationshipsItem = RelationshipsItem() relationshipsItem["fan_id"] = ID if flag else uid relationshipsItem["followed_id"] = uid if flag else ID yield relationshipsItem yield Request(url="https://weibo.cn/%s/info" % uid, callback=self.parse_information) next_page = '下页'.decode('utf-8') next_url = selector.xpath('//a[text()="%s"]/@href' % next_page).extract() if next_url: yield Request(url=self.host + next_url[0], callback=self.parse_relationship, dont_filter=True)
scrapy middlewares代码如下:
# encoding: utf-8import randomfrom sina.cookies import cookies, init_cookiesfrom sina.user_agents import agentsclass UserAgentMiddleware(object): """ 换User-Agent """ def process_request(self, request, spider): agent = random.choice(agents) request.headers["User-Agent"] = agentclass CookiesMiddleware(object): """ 换Cookie """ def __init__(self): init_cookies() def process_request(self, request, spider): cookie = random.choice(cookies) request.cookies = cookie 这一点就是获取到mongo中保存的cookie在scrapy下载中间件去发送请求的时候携带给request,这样就可以携带cookie去获取数据了。
5.scrapy pipelines保存数据:
众所周知,pipelines是用来清洗数据和保存数据的管道# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongofrom sina.items import RelationshipsItem, TweetsItem, InformationItemimport timeimport randomimport jsonclass MongoDBPipeline(object): def __init__(self): clinet = pymongo.MongoClient("服务器的ip", 27017) db = clinet["Sina"] self.Information = db["Information"] self.Tweets = db["Tweets"] self.Relationships = db["Relationships"] def process_item(self, item, spider): """ 判断item的类型,并作相应的处理,再入数据库 """ if isinstance(item, RelationshipsItem): try: self.Relationships.insert(dict(item)) except Exception: pass elif isinstance(item, TweetsItem): try: self.Tweets.insert(dict(item)) filename = item['filepath'] lines = json.dumps(dict(item), ensure_ascii=False) + '\n' with open(filename, 'w') as f: f.write(lines) except Exception,e: print e elif isinstance(item, InformationItem): try: self.Information.insert(dict(item)) except Exception: pass return item
6.为了数据稳定性抓取。我们还需要构建一个user_agent代理池去不停的跟换并伪装成浏览器。具体实现如下:
#!/usr/bin/env python# encoding: utf-8""" User-Agents """agents = [ "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5", "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5", "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)", "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )", "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)", "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a", "Mozilla/2.02E (Win95; U)", "Mozilla/3.01Gold (Win95; I)", "Mozilla/4.8 [en] (Windows NT 5.1; U)", "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)", "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3", "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1", "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3", "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1"]
middlewares中会获取这个user-agent数组在scrapy的process_request()方法中,将请求携带指定的user-agent去发送请求。
走到这里基本都介绍完了,谢谢大家!如有疑问请留言,我会详细回复。(第一次写简书,请多多包涵)后续会出如何抓取知乎、今日头条等文章!
作者:可爱的小虫虫
链接:https://www.jianshu.com/p/1890e9b3ba37
共同学习,写下你的评论
评论加载中...
作者其他优质文章