今天我们来实现一下QQ音乐的爬虫,实现对榜单里面的歌曲的下载
主页
榜单内容
可以简单分析一下页面,网页也是基于动态处理的,所以有必要对所需的数据包进行抓取,QQ音乐会不定时进行更新,所以每一期的规则会不一样,这里是基于目前的规则进行编写的代码,给大家偷个懒,有关歌曲数据的数据包基本上都包含fcg关键字,可以直接筛选,大家也可以自行查看preview进行判断
这里是榜单歌曲信息包:,这里就作为我们爬虫的切入点,从这里可以获取到歌曲的基本信息,包括歌曲id和名字,后面会用到这些信息,我们先记住,慢慢来进行分析
我们打开播放页面,对歌曲媒体文件进行抓取,直接获取media数据即可
仔细观察会发现不同歌曲下载链接之间的饿异同点,去抓取不同的歌曲数据包会发现包括guid,format等参数都是固定数值,这里变化的只有C400后面的参数(仔细观察发现这里就是songmid值)和vkey值。
我们再对vkey相关的数据包进行抓取,从名字就能简单看出这个数据包适合vkey相关的
这里是vkey数据包 ,我们将数据整理一下(放在json在线解析页面整理)查看,对比一下不难发现vkey值的保存地址,这里的purl地址就是C400后面那一串加上vkey后面,也是省去不少麻烦
对这里vkey连接里header里面真实url的连接进行分析,发现后面的数据参数基本上就是后面data里面的参数,只是除了data里面的songmid不同外,所以这里只需要将songmid进行构造一下然后进行页面获取即可
def getVkey(songmid):
vkey_url = "https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey05137740976859173&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22songmid%22%3A%5B%22{0}%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D".format(songmid)
res = requests.get(url=vkey_url)
time.sleep(0.5)
res02 = json.loads(res.text)
vkey = res02["req_0"]["data"]["midurlinfo"][0]["purl"]
return vkey
我们随便拿一首歌的songmid和vkey进行验证,发现是可以下载的,至此完整流程我们已经完成,基本上就是:
- 获取歌曲songmid
- 通过songmid获取vkey
- 通过vkey组合的下载链接进行歌曲获取
代码实现
#!/usr/bin/env python
# -*- coding:utf-8 -*-
'''
@author: maya
@contact: 1278077260@qq.com
@software: Pycharm
@file: music.py
@time: 2019/1/8 12:48
@desc:
'''
import json
import requests
import time
import os
import urllib
headers = {
"cookie": 'RK=51FHFw4aE8; pgv_pvi=8430643200; ptcz=83cfc479ce75c5a1416df7d87136166109888f38587d9944738abca7ab77d17c; tvfe_boss_uuid=e4ba183f02ae980f; pgv_pvid=3169027098; pgv_pvid_new=2426636288_14882e87533; mobileUV=1_15f666e2b04_e8a50; pac_uid=1_1278077260; eas_sid=l1C5q306s9W2d845F9u7f1K1U6; ptui_loginuin=40370953; o_cookie=1278077260; luin=o1278077260; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22%24device_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; lskey=00010000a5727043706a88a2aebf6044daf687035fcc0804760fd13cac0729275356f7aa88d5157b46210ea6; LW_sid=y1s5J425D4j7u9N1Q8Q0j2k383; LW_uid=p1q5u4d584A7f971l820z2k3M9; ts_uid=4705118039; yq_index=0; uin=o1278077260; skey=@mXN9mj3as; p_uin=o1278077260; pt4_token=cVwioR9KifEllUyD2CPEXz692iNhDH8JE-YwH*5TlRY_; p_skey=BE7HSxnTeFIPwrO6sJ*YXyA1xKGxT072f5YAo919LSY_; yqq_stat=0; pgv_si=s3828307968; pgv_info=ssid=s3773836208; ts_last=y.qq.com/n/yqq/toplist/4.html; ts_refer=link.zhihu.com/%3Ftarget%3Dhttps%253A//y.qq.com/n/yqq/toplist/4.html%2523stat%253Dy_new.toplist.menu.4',
"user-agent": 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36'
}
def getHtml(start_url):
try:
r = requests.get(start_url, headers=headers)
r.encoding = r.apparent_encoding
text = json.loads(r.text)
return text
except:
return ""
def getSongMid(html):
songmid = []
for tid in html['songlist']:
songmid.append([tid['data']['songmid'], tid['data']['songname']])
return songmid
def getSong(html):
start_index = 0
while (True):
start_num = start_index * 30
num = 30
start_index += 1
update_key = html['update_time'] # 有些update_key为2018-5,而实际请求需要传递2018-05,因此需要转换下
temp_key = update_key.split("_")
if (len(temp_key) == 3):
if len(temp_key[1]) == 1:
update_key = temp_key[0] + '_0' + temp_key[1] + temp_key[2]
elif len(temp_key[2]) == 1:
update_key = temp_key[0] + temp_key[1] + '_0' + temp_key[2]
page_url = "https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date={0}&topid=4&type=top&song_begin={1}&song_num=30&g_tk=1154346586&loginUin=1278077260&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0".format(
update_key, start_num)
json_text = getHtml(page_url)
songinfo = getSongMid(json_text)
if len(songinfo) == 0:
break
for sid in songinfo:
vkey = getVkey(sid[0])#获取每首音乐的vkey
saveMusic(sid[0],vkey,sid[1])#保存此音乐
time.sleep(1)#休眠1秒,防止被服务器过滤掉
def getVkey(songmid):
vkey_url = "https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey05137740976859173&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22songmid%22%3A%5B%22{0}%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D".format(songmid)
res = requests.get(url=vkey_url)
time.sleep(0.5)
res02 = json.loads(res.text)
vkey = res02["req_0"]["data"]["midurlinfo"][0]["purl"]
return vkey
def saveMusic(songmid, vkey, name):
headers['Host'] = 'dl.stream.qqmusic.qq.com'
url = "http://dl.stream.qqmusic.qq.com/" + vkey
res = requests.get(url, headers=headers, stream=True)
filename = 'song/{0}.m4a'.format(name.replace("?", "").replace("/", "_").replace("\\", "_").replace("\"", ""))
print("***** 正在下载 *****")
print(url)
print("*****歌曲:{}".format(name.replace("?", "").replace("/", "_").replace("\\", "_").replace("\"", "")))
with open(filename, 'wb') as f:
f.write(res.raw.read())
if(urllib.request.urlopen(url).getheader('Content-Length') > 0):
print("成功下载歌曲:{}".format(name.replace("?", "").replace("/", "_").replace("\\", "_").replace("\"", "")))
# size = urllib.request.urlopen(url).getheader('Content-Length')
# print(size)
else:
print("下载失败")
os.remove(filename)
if __name__ == '__main__':
start_url = "https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date=2019-01-08&topid=4&type=top&song_begin=0&song_num=30&g_tk=1154346586&loginUin=1278077260&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0"
text = getHtml(start_url)
getSong(text)
多线程版本:
import requests
import json
import time
from datetime import datetime
import threading
date_time=datetime.now().date()
def func(num):
starturl="https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date={0}&topid=4&type=top&song_begin={1}&song_num=30&g_tk=1285181755&loginUin=2521763805&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0".format(date_time,num*30)
print(starturl)
headers = {
"cookie": 'RK=51FHFw4aE8; pgv_pvi=8430643200; ptcz=83cfc479ce75c5a1416df7d87136166109888f38587d9944738abca7ab77d17c; tvfe_boss_uuid=e4ba183f02ae980f; pgv_pvid=3169027098; pgv_pvid_new=2426636288_14882e87533; mobileUV=1_15f666e2b04_e8a50; pac_uid=1_1278077260; eas_sid=l1C5q306s9W2d845F9u7f1K1U6; ptui_loginuin=40370953; o_cookie=1278077260; luin=o1278077260; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22%24device_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; lskey=00010000a5727043706a88a2aebf6044daf687035fcc0804760fd13cac0729275356f7aa88d5157b46210ea6; LW_sid=y1s5J425D4j7u9N1Q8Q0j2k383; LW_uid=p1q5u4d584A7f971l820z2k3M9; ts_uid=4705118039; yq_index=0; uin=o1278077260; skey=@mXN9mj3as; p_uin=o1278077260; pt4_token=cVwioR9KifEllUyD2CPEXz692iNhDH8JE-YwH*5TlRY_; p_skey=BE7HSxnTeFIPwrO6sJ*YXyA1xKGxT072f5YAo919LSY_; yqq_stat=0; pgv_si=s3828307968; pgv_info=ssid=s3773836208; ts_last=y.qq.com/n/yqq/toplist/4.html; ts_refer=link.zhihu.com/%3Ftarget%3Dhttps%253A//y.qq.com/n/yqq/toplist/4.html%2523stat%253Dy_new.toplist.menu.4',
"user-agent": 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36'
}
res=requests.get(url=starturl,headers=headers)
res=res.text
res=json.loads(res)
songname=[]
songmid=[]
for i in res["songlist"]:
songname.append(i["data"]["songname"])
songmid.append(i["data"]["songmid"])
mid_name=dict(zip(songmid,songname))
for j in mid_name:
vkey_url ="https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey05137740976859173&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22songmid%22%3A%5B%22{0}%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D".format(j)
res02=requests.get(url=vkey_url)
time.sleep(0.5)
res02 = res02.text
res02 = json.loads(res02)
vkey=res02["req_0"]["data"]["midurlinfo"][0]["purl"]
url="http://dl.stream.qqmusic.qq.com/"+vkey
try:
filename="music/"+mid_name[j]+".m4a"
print(filename)
res03=requests.get(url=url,headers=headers)
with open(filename,"wb") as f:
f.write(res03.content)
except:
continue
# threading_list=[]
# for the in range(4):
# threadParse = threading.Thread(target=func(the))
# threading_list.append(threadParse)
#
# for th in threading_list:
# th.setDaemon(True)
# th.start()
for lon in range(4):
func(lon)
- 这里通过urllib对歌曲数据进行判断,去除无法下载的歌曲(由于权限等问题)
- 代码中没有对文件夹进行建立,大家可以自行修改一下,也可以直接建立相应文件夹
- 更多爬虫代码详情查看Github
共同学习,写下你的评论
评论加载中...
作者其他优质文章