首页猿问从 word 转换的 pdf...

从 word 转换的 pdf 中读取复选框值

Java

有只小跳蛙 2023-06-21 13:27:59

document-checkbox我有一个从 word（另存为 pdf）转换为 pdf 的 PDF 文件，在 word 中我们有几个选中/未选中的复选框，转换后的 pdf 显示为复选框，但是它们不是复选框也不是图像。我需要读取这些复选框值（选定/未选定），但我无法读取这些值。我正在尝试使用 PDFBOX。我认为这些复选框是图像 - 尝试提取 pdf 中的所有图像，但这些（显示为）复选框不是图像。我想知道这些复选框如何保存在 PDF 中，请告诉我如何读取这些复选框值？建议任何 API - 我会尝试相同的。谢谢达亚

查看完整描述

1 回答

暮色呼如

TA贡献1853条经验获得超9个赞

当您将包含 Word 表单字段的 Word 文档转换为 PDF（使用另存为 *.pdf）时，遗憾的是，没有从中创建 PDF 表单字段。（这本来就很整洁）。复选框存储为MS Gothic字体的字符。因此，如果您想提取它们，您需要提取 PDF 的文本。该复选框可以有两种状态，因此有两个字符：

☐ - 统一码 2610

☒ - unicode 2612

一些示例代码：

public static void main(String args[]) throws IOException {

InputStream pdfIs = //load your PDF

RandomAccessBufferedFileInputStream rbfi = new RandomAccessBufferedFileInputStream(pdfIs);

PDFParser parser = new PDFParser(rbfi);

parser.parse();

try (COSDocument cosDoc = parser.getDocument()) {

PDFTextStripper pdfStripper = new PDFTextStripper();

PDDocument pdDoc = new PDDocument(cosDoc);

String parsedText = pdfStripper.getText(pdDoc);

//System.out.println("Full text"+parsedText);

for (int i = 0; i < parsedText.length(); i++) {

if('☒'==parsedText.charAt(i)) {

System.out.println("Found a checked box at index "+i);

System.out.println("\\u" + Integer.toHexString(parsedText.charAt(i) | 0x10000).substring(1));

}

else if('☐'==parsedText.charAt(i)) {

System.out.println("Found an unchecked box at index "+i);

System.out.println("\\u" + Integer.toHexString(parsedText.charAt(i) | 0x10000).substring(1));

}

//else {//skip}

}

更新：

您提供了示例 PDF。复选框以“绘图”的形式存储为 xobject 流。如果您查看页面对象，内容入口会为您指明正确的方向：您会在其中找到以以下内容开头的3 0 obj

/Type /Page

/Contents 4 0 R

...

内容：4 0 obj

4 0 obj

/Length 807

stream

/P <</MCID 0>> BDC q

0.00000912 0 612 792 re

W* n

/F1 9.96 Tf

1 0 0 1 72.024 710.62 Tm

/GS7 gs

0 g

/GS8 gs

0 G

[( )] TJ

EMC q

0.000018243 0 612 792 re

W* n

/P <</MCID 1>> BDC 0.72 w

0 G

1 j

73.104 696.34 9.24 9.24 re

0.48 w

72.984 705.7 m

82.464 696.22 l

82.464 705.7 m

72.984 696.22 l

EMC /P <</MCID 2>> BDC q

0.00000912 0 612 792 re

W* n

/F1 9.96 Tf

1 0 0 1 83.544 697.3 Tm

0 g

0 G

[( )] TJ

这基本上就是复选框的绘制方式。您现在可以使用 pdfbox 阅读此内容，但您必须自己解释/识别它。看看 PDF 规范如何解释这些绘图指令......

反对回复 2023-06-21

1 回答
0 关注
558 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

从 word 转换的 pdf 中读取复选框值

从 word 转换的 pdf 中读取复选框值

1 回答

添加回答