首页手记爬取百度百科5A景点摘要并实现分词

爬取百度百科5A景点摘要并实现分词

标签：

大数据

一、编程环境

操作系统：Win 10
语言：Python 3.6
分词工具：结巴分词

二、程序目录

1.png

这里baike_spider.py用来爬取景点摘要，内容放在senic_spots目录中；
cut_word.py用来分词，分词结果放在cut_word_result中；
scenic_spots_5A.txt中列出了所要爬取的景点的名称，具体内容如下：

北京故宫
天坛公园
颐和园
八达岭
慕田峪长城
明十三陵
恭王府
北京奥林匹克公园

注意，scenic_spots和cut_word_result这两个文件夹不需要提前创建，程序运行时会自动创建。

三、爬取景点摘要

baike_spider.py中的代码：

import osimport timeimport codecsimport shutilfrom selenium import webdriverfrom selenium.webdriver.common.keys import Keys


driver = webdriver.Chrome(executable_path="C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")def getInfoBox(spotname, filename):
    try:
        print(filename)
        info = codecs.open(filename,'w','utf-8')
            
        driver.get("http://baike.baidu.com/")
        elem_input = driver.find_element_by_xpath("//form[@id='searchForm']/input")
        time.sleep(2)
        spotname = spotname.rstrip('\n')        # 景点名称是从文件中读取的，含有换行符（最后一行的景点名称可能不含护身符）
        elem_input.send_keys(spotname)
        elem_input.send_keys(Keys.RETURN)
        
        info.write(spotname + '\r\n')       # codecs不支持'\n'换行
        print (driver.current_url)        print (driver.title)
        
        elem_value = driver.find_elements_by_xpath("//div[@class='lemma-summary']/div")        for value in elem_value:            print (value.text)
            info.writelines(value.text + '\r\n')
        time.sleep(2)
        info.close()    except Exception as e:  
        print ("Error: ", e)    finally:        pass
        def main():
    # 创建路径
    path = "scenic_spots\\"
    if os.path.isdir(path):
        shutil.rmtree(path, True)
    os.makedirs(path)
    
    source = open("scenic_spots_5A.txt", 'r')
    num = 1
    for scenicspot in source:
        name = "%03d" % num
        fileName = path + str(name) + ".txt"
        getInfoBox(scenicspot, fileName)
        num += 1
    print ('End Read Files!')
    time.sleep(10)
    
    source.close()
    driver.close()if __name__ == '__main__':
    main()

运行结果：
在scenic_spots目录下，生成了8个txt文件，每个文件存放一个景点的摘要内容

2.png

3.png

四、实用结巴工具实现分词

cut_word.py中的代码：

import sysimport codecsimport osimport shutilimport jiebadef read_file_cut():
    #create path
    path = "scenic_spots\\"
    respath = "cut_word_result\\"
    if os.path.isdir(respath):
        shutil.rmtree(respath, True)
    os.makedirs(respath)

    num = 1
    while num <= 8:
        name = "%03d" % num 
        fileName = path + str(name) + ".txt"
        source = open(fileName, 'r', encoding = 'utf-8')
        line = source.readline()
        line = line.rstrip('\n')

        resName = respath + str(name) + ".txt"
        if os.path.exists(resName):
            os.remove(resName)
        result = codecs.open(resName, 'w', encoding = 'utf-8')        while line != "":
            seglist = jieba.cut(line,cut_all=False)  #精确模式
            output = ' '.join(list(seglist))         #空格拼接
            print (output)
            result.write(output + '\r\n')
            line = source.readline()        else:            print ('End file: ' + str(num))
            source.close()
            result.close()
        num += 1
    else:        print ('End All')if __name__ == '__main__':
    read_file_cut()

运行结果：
在cut_word_result目录下，生成了8个文件，每个文件存放的是分词后的内容：

4.png

5.png

作者：海天一树X
链接：https://www.jianshu.com/p/fa9e30dce662

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

青春有我

JAVA开发工程师

手记
篇

粉丝

205

获赞与收藏

1008

关注作者，订阅最新文章

阅读免费教程

后端通用面试教程

41个小节 30854 345

网络编程入门教程

20个小节 12725 240

Pandas 入门教程

25个小节 18607 342

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

爬取百度百科5A景点摘要并实现分词

一、编程环境

二、程序目录

三、爬取景点摘要

四、实用结巴工具实现分词

阅读免费教程