1 回答
TA贡献1801条经验 获得超8个赞
这是将返回新闻文章源代码以及元数据的代码。
# wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/warc/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz
# gunzip CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz
#!pip install warc3-wet
import warc
var = -10
with warc.open("CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc") as f:
for record in f:
if var > 1:
break
else:
print (record.payload.read(), record.date, record.from_response, record.header, record.ip_address, record.offset, record.payload, record.type, record.url, record.write_to)
var = var + 1
添加回答
举报