为了账号安全,请及时绑定邮箱和手机立即绑定

使用python pdfminer提取整个pdf数据

使用python pdfminer提取整个pdf数据

哈士奇WWW 2021-03-06 11:09:09
我正在使用pdfminer使用python从pdf文件中提取数据。我想提取pdf中存在的所有数据,而不管它是图像还是文本,无论它是什么。我们可以在一行中执行此操作吗(如果需要,可以执行两条操作,而无需进行大量工作)。任何帮助表示赞赏。提前致谢
查看完整描述

3 回答

?
繁花如伊

TA贡献2012条经验 获得超12个赞

对于Python 3:


点安装pdfminer.six


from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

from pdfminer.converter import TextConverter

from pdfminer.layout import LAParams

from pdfminer.pdfpage import PDFPage

from io import StringIO


def convert_pdf_to_txt(path, codec='utf-8'):

    rsrcmgr = PDFResourceManager()

    retstr = StringIO()

    laparams = LAParams()

    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = open(path, 'rb')

    interpreter = PDFPageInterpreter(rsrcmgr, device)

    password = ""

    maxpages = 0

    caching = True

    pagenos=set()


    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):

        interpreter.process_page(page)


    text = retstr.getvalue()


    fp.close()

    device.close()

    retstr.close()

    return text


查看完整回答
反对 回复 2021-03-31
?
慕田峪4524236

TA贡献1875条经验 获得超5个赞

对于python3,还有另一个:pip install pdfminer3k


from pdfminer.pdfinterp import PDFResourceManager, process_pdf

from pdfminer.converter import TextConverter

from pdfminer.layout import LAParams

from io import StringIO

import time

from functools import wraps


def fn_timer(function)://this is for calculating the run time(function)

    @wraps(function)

    def function_timer(*args, **kwargs):

        t0 = time.time()

        result = function(*args, **kwargs)

        t1 = time.time()

        print ("Total time running %s: %s seconds" %

                ('test', str(t1-t0))

                )

        return result

    return function_timer


@fn_timer

def convert_pdf(path, pages):

    rsrcmgr = PDFResourceManager()

    retstr = StringIO()

    laparams = LAParams()

    device = TextConverter(rsrcmgr, retstr, laparams=laparams)


    fp = open(path, 'rb')

    process_pdf(rsrcmgr, device, fp,pages)

    fp.close()

    device.close()


    str = retstr.getvalue()

    retstr.close()

    return str


file = r'M:\a.pdf'


print(convert_pdf(file,[1,]))


查看完整回答
反对 回复 2021-03-31
  • 3 回答
  • 0 关注
  • 328 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信