从 pdf 解析注释

我想要一个 python 函数,它接受 pdf 并返回文档中注释注释的文本列表。我看过 python-poppler ( https://code.launchpad.net/~poppler-python/poppler-python/trunk ) 但我不知道如何让它给我任何有用的东西。

我找到了 get_annot_mapping 方法并修改了提供的演示程序以通过 self.current_page.get_annot_mapping() 调用它,但我不知道如何处理 AnnotMapping 对象。好像没有完全实现,只提供了copy方法。

如果有任何其他库提供此功能,那也没关系。

stack overflow Parse annotations from a pdf
原文答案
author avatar

接受的答案

原来绑定是不完整的。现在已修复。 https://bugs.launchpad.net/poppler-python/+bug/397850


答案:

作者头像

以防万一有人正在寻找一些工作代码。这是我使用的脚本。

import poppler
import sys
import urllib
import os

def main():
  input_filename = sys.argv[1]
    # http://blog.hartwork.org/?p=612
  document = poppler.document_new_from_file('file://%s' % 
    urllib.pathname2url(os.path.abspath(input_filename)), None)
  n_pages = document.get_n_pages()
  all_annots = 0

  for i in range(n_pages):
        page = document.get_page(i)
        annot_mappings = page.get_annot_mapping ()
        num_annots = len(annot_mappings)
        if num_annots > 0:
            for annot_mapping in annot_mappings:
                if  annot_mapping.annot.get_annot_type().value_name != 'POPPLER_ANNOT_LINK':
                    all_annots += 1
                    print('page: {0:3}, {1:10}, type: {2:10}, content: {3}'.format(i+1, annot_mapping.annot.get_modified(), annot_mapping.annot.get_annot_type().value_nick, annot_mapping.annot.get_contents()))

  if all_annots > 0:
    print(str(all_annots) + " annotation(s) found")
  else:
    print("no annotations found")

if __name__ == "__main__":
    main()
作者头像

您绝对应该看看 PyPDF2 。这个惊人的图书馆具有令人难以置信的潜力,您可以从PDF中提取任何内容,包括图像或评论。尝试首先检查Acrobat Reader DC(读取器)可以在PDF的评论中给您的内容。取一个简单的PDF,用读取器和右上角的“注释”选项卡进行注释(添加一些注释),单击“水平”三个点,然后单击 Export All To Data File... ,然后使用扩展名 xfdf 选择格式。这将创建一个很棒的XML文件,您可以解析。该格式非常透明和不言而喻。

但是,如果您不能依靠单击此内容的用户,而需要使用Python以编程方式提取相同的数据,请不要绝望,请有一个解决方案。 (受 Extract images from PDF without resampling, in python? 的启发)

先决条件

pip install PyPDF2

XFDF XML

读者在上述XFDF文件中为您提供的内容,看起来像这样:

<?xml version="1.0" ?>
<xfdf xml:space="preserve" xmlns="http://ns.adobe.com/xfdf/">
    <annots>
        <caret IT="Replace" color="#0000FF" creationdate="D:20190221151519+01'00'" date="D:20190221151526+01'00'" flags="print" fringe="1.069520,1.069520,1.069520,1.069520" name="72f8d1b7-d878-4281-bd33-3a6fb4578673" page="0" rect="636.942000,476.891000,652.693000,489.725000" subject="Inserted Text" title="Admin">
            <contents-richtext>
                <body xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
                    <p dir="ltr">
                        <span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"> comment1</span>
                    </p>
                </body>
            </contents-richtext>
            <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,374.656000,941.008000,488.656000"/>
        </caret>
        <highlight color="#FFD100" coords="183.867000,402.332000,220.968000,402.332000,183.867000,387.587000,220.968000,387.587000" creationdate="D:20190221151441+01'00'" date="D:20190221151448+01'00'" flags="print" name="a18c7fb0-0af3-435e-8c32-1af2af3c46ea" opacity="0.399994" page="0" rect="179.930000,387.126000,224.904000,402.793000" subject="Highlight" title="Admin">
            <contents-richtext>
                <body xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
                    <p dir="ltr">
                        <span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment2</span>
                    </p>
                </body>
            </contents-richtext>
            <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,288.332000,941.008000,402.332000"/>
        </highlight>
        <caret color="#0000FF" creationdate="D:20190221151452+01'00'" date="D:20190221151452+01'00'" flags="print" fringe="0.828156,0.828156,0.828156,0.828156" name="6bf0226e-a3fb-49bf-bc89-05bb671e1627" page="0" rect="285.877000,372.978000,298.073000,382.916000" subject="Inserted Text" title="Admin">
            <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,268.088000,941.008000,382.088000"/>
        </caret>
        <strikeout IT="StrikeOutTextEdit" color="#0000FF" coords="588.088000,497.406000,644.818000,497.406000,588.088000,477.960000,644.818000,477.960000" creationdate="D:20190221151519+01'00'" date="D:20190221151519+01'00'" flags="print" inreplyto="72f8d1b7-d878-4281-bd33-3a6fb4578673" name="6686b852-3924-4252-af21-c1b10390841f" page="0" rect="582.290000,476.745000,650.616000,498.621000" replyType="group" subject="Cross-Out" title="Admin">
            <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,383.406000,941.008000,497.406000"/>
        </strikeout>
    </annots>
    <f href="p1.pdf"/>
    <ids modified="ABB10FA107DAAA47822FB5D311112349" original="474F087D87E7E544F6DEB9E0A93ADFB2"/>
</xfdf>

此处将各种类型的注释作为标签呈现在 <annots> 块中。

使用PYPDF2

Python可以为您提供几乎相同的数据。要获得它,请查看以下脚本的输出给出的内容:

from PyPDF2 import PdfFileReader

reader = PdfFileReader("/path/to/my/file.pdf")

for page in reader.pages:
    try :
        for annot in page["/Annots"] :
            print (annot.getObject())       # (1)
            print ("")
    except : 
        # there are no annotations on this page
        pass

与上面XFDF文件相同的文件的输出将看起来像这样:

{'/Popup': IndirectObject(192, 0), '/M': u"D:20190221151448+01'00'", '/CreationDate': u"D:20190221151441+01'00'", '/NM': u'a18c7fb0-0af3-435e-8c32-1af2af3c46ea', '/F': 4, '/C': [1, 0.81961, 0], '/Rect': [179.93, 387.126, 224.904, 402.793], '/Type': '/Annot', '/T': u'Admin', '/RC': u'<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" ><p dir="ltr"><span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment2</span></p></body>', '/P': IndirectObject(5, 0), '/Contents': u'otrasneho', '/QuadPoints': [183.867, 402.332, 220.968, 402.332, 183.867, 387.587, 220.968, 387.587], '/Subj': u'Highlight', '/CA': 0.39999, '/AP': {'/N': IndirectObject(202, 0)}, '/Subtype': '/Highlight'}

{'/Parent': IndirectObject(191, 0), '/Rect': [737.008, 288.332, 941.008, 402.332], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A425D0>, '/Subtype': '/Popup'}

{'/Popup': IndirectObject(194, 0), '/M': u"D:20190221151452+01'00'", '/CreationDate': u"D:20190221151452+01'00'", '/NM': u'6bf0226e-a3fb-49bf-bc89-05bb671e1627', '/F': 4, '/C': [0, 0, 1], '/Subj': u'Inserted Text', '/Rect': [285.877, 372.978, 298.073, 382.916], '/Type': '/Annot', '/P': IndirectObject(5, 0), '/AP': {'/N': IndirectObject(201, 0)}, '/RD': [0.82816, 0.82816, 0.82816, 0.82816], '/T': u'Admin', '/Subtype': '/Caret'}

{'/Parent': IndirectObject(193, 0), '/Rect': [737.008, 268.088, 941.008, 382.088], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42830>, '/Subtype': '/Popup'}

{'/Popup': IndirectObject(196, 0), '/M': u"D:20190221151519+01'00'", '/CreationDate': u"D:20190221151519+01'00'", '/NM': u'6686b852-3924-4252-af21-c1b10390841f', '/F': 4, '/IRT': IndirectObject(197, 0), '/C': [0, 0, 1], '/Rect': [582.29, 476.745, 650.616, 498.621], '/Type': '/Annot', '/T': u'Admin', '/P': IndirectObject(5, 0), '/QuadPoints': [588.088, 497.406, 644.818, 497.406, 588.088, 477.96, 644.818, 477.96], '/Subj': u'Cross-Out', '/IT': '/StrikeOutTextEdit', '/AP': {'/N': IndirectObject(200, 0)}, '/RT': '/Group', '/Subtype': '/StrikeOut'}

{'/Parent': IndirectObject(195, 0), '/Rect': [737.008, 383.406, 941.008, 497.406], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42AF0>, '/Subtype': '/Popup'}

{'/Popup': IndirectObject(198, 0), '/M': u"D:20190221151526+01'00'", '/CreationDate': u"D:20190221151519+01'00'", '/NM': u'72f8d1b7-d878-4281-bd33-3a6fb4578673', '/F': 4, '/C': [0, 0, 1], '/Rect': [636.942, 476.891, 652.693, 489.725], '/Type': '/Annot', '/RD': [1.06952, 1.06952, 1.06952, 1.06952], '/T': u'Admin', '/RC': u'<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" ><p dir="ltr"><span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment1</span></p></body>', '/P': IndirectObject(5, 0), '/Contents': u' pica', '/Subj': u'Inserted Text', '/IT': '/Replace', '/AP': {'/N': IndirectObject(212, 0)}, '/Subtype': '/Caret'}

{'/Parent': IndirectObject(197, 0), '/Rect': [737.008, 374.656, 941.008, 488.656], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42AB0>, '/Subtype': '/Popup'}

如果您检查输出,您将意识到输出或多或少是相同的。 XFDF文件中的每个注释都在Python的PYPDF2输出中都有两个评论。 /C 属性是亮点的颜色,在RGB中,缩放到范围<0,1>的浮子。 /Rect 相对于页面的下角(1/72英寸),在页面/vizer上定义了注释的边界框,相对于页面的左下角,将值向右及向上增加。 /M/CreationDate 进行了修改,创建时间, /QuadPoints[x1, y1, x2, y2, ..., xn, yn] 坐标的数组, /Subject/Type/SubType/IT/T/RC ,{ /InkList 标识注释的类型, [[L1x1, L1y1, L1x2, L1y2, ..., L1xn, L1yn], [L2x1, L2y1, ..., L2xn, L2yn], ..., [Lmx1, Lmy1, ..., Lmxn, Lmyn]] 可能是创建者, getObject() 是注释文本的XHTML表示。如果有墨水的注释,则在此处将其显示为具有 https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf 的属性 ,其中包括第1行,第2行,...,线m的form 中的数据。

要对您从标记为行(1)的给定Python代码中 获得的各个字段进行更彻底的说明,请咨询 ,尤其是第12.5节的注释,从第381-413页开始。

作者头像

pdf-annots 脚本可以从PDF提取注释。它是建立在 PDFMineer.six 的基础上的,并在Markdown中为突出显示的文本及其所做的任何注释都产生输出,例如在突出显示区域或弹出框上的评论。输出看起来与此相似:

 * Page 2 Highlight:
 > Underlying text that was highlighted

 Comment made on highlighted text.

 * Page 3 Highlight: "Short highlighted text" -- Short comment.

 * Page 4 Text: A note on the page.

完整的命令选项可以在下面看到。

usage: pdfannots.py [-h] [-p] [-o OUTFILE] [-n COLS] [-s [SEC [SEC ...]]] [--no-group]
                    [--print-filename] [-w COLS]
                    INFILE [INFILE ...]

Extracts annotations from a PDF file in markdown format for use in reviewing.

positional arguments:
  INFILE                PDF files to process

optional arguments:
  -h, --help            show this help message and exit

Basic options:
  -p, --progress        emit progress information
  -o OUTFILE            output file (default is stdout)
  -n COLS, --cols COLS  number of columns per page in the document (default: 2)

Options controlling output format:
  -s [SEC [SEC ...]], --sections [SEC [SEC ...]]
                        sections to emit (default: highlights, comments, nits)
  --no-group            emit annotations in order, don't group into sections
  --print-filename      print the filename when it has annotations
  -w COLS, --wrap COLS  wrap text at this many output columns

我还没有广泛尝试过,但是到目前为止它一直运行良好!

作者头像

这是一个工作示例(从上一个 answer 移植)用Python模块 popplerqt5 提取注释: python3 extract.py sample.pdf

import popplerqt5
import argparse

def extract(fn):
    doc = popplerqt5.Poppler.Document.load(fn)
    annotations = []
    for i in range(doc.numPages()):
        page = doc.page(i)
        for annot in page.annotations():
            contents = annot.contents()
            if contents:
                annotations.append(contents)
                print(f'page={i + 1} {contents}')

    print(f'{len(annotations)} annotation(s) found')
    return annotations

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('fn')
    args = parser.parse_args()
    extract(args.fn)
作者头像

有人问 similar question 。我在那里尝试了代码样本,直到我进行了一些功能和化妆品更改之前,它对我不起作用。

#!/usr/bin/ruby

require 'pdf-reader'

ARGV.each do |filename|
  PDF::Reader.open(filename) do |reader|
    puts "file: #{filename}"
    puts "pagetcomment"
    reader.pages.each do |page|
      annots_ref = page.attributes[:Annots]
      if annots_ref
        actual_annots = annots_ref.map { |a| reader.objects[a] }
        actual_annots.each do |actual_annot|
          unless actual_annot[:Contents].nil?
            puts "#{page.number}t#{actual_annot[:Contents]}"
          end
        end
      end
    end       
  end
end

如果保存为 pdfannot.rbchmod +x 'ed并放入您喜欢的 PATH 目录中,则用法是:

./pdfannot.rb <path>

第一次写作/编辑/混合红宝石代码,因此非常开放。Hth。

附带说明,早些时候发现这个问题可以使我免于双重工作。希望这个问题在将来得到更多的关注,从而更容易找到。

作者头像

我从来没有用过这个,也不想要这种功能,但我发现 PDFMiner - 这个链接有关于基本用法的信息,也许这就是你要找的?

作者头像

PyMuPDF 的作者@jorjmckie为我写了一个摘要,我修改了一些:

import fitz  # to import the PyMuPDF library
# from pprint import pprint

def _parse_highlight(annot: fitz.Annot, wordlist: list) -> str:
    points = annot.vertices
    quad_count = int(len(points) / 4)
    sentences = ['' for i in range(quad_count)]
    for i in range(quad_count):
        r = fitz.Quad(points[i * 4: i * 4 + 4]).rect
        words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
        sentences[i] = ' '.join(w[4] for w in words)
    sentence = ' '.join(sentences)
    return sentence

def main() -> dict:
    doc = fitz.open('path/to/your/file')
    page = doc[0]

    wordlist = page.getText("words")  # list of words on page
    wordlist.sort(key=lambda w: (w[3], w[0]))  # ascending y, then x

    highlights = {}
    annot = page.firstAnnot
    i = 0
    while annot:
        if annot.type[0] == 8:
            highlights[i] = _parse_highlight(annot, wordlist)
            i += 1
            print('> ' + highlights[i] + 'n')
        annot = annot.next

    # pprint(highlights)
    return highlights

if __name__ == "__main__":
    main()

尽管结果中仍然有一些小错别字:

> system upsets,

> expansion of smart grid monitoring devices that generally provide nodal voltages and power injections at fine spatial resolution,

> hurricanes to indi- vidual lightning strikes),
作者头像
from typing import Dict, List

from pdfannots import process_file

def get_pdf_annots(pdf_filename) -> Dict[int, List[str]]:
    """
    Return example:
    {
        0: ["Human3.6M", "Our method"],
        3: [
            "pretrained using 3D mocap data"
        ],
    }
    """
    annots_dict = dict()
    document = process_file(open(pdf_filename, "rb"))
    for page_idx in range(len(document.pages)):
        annots = document.pages[page_idx].annots
        for annot in annots:
            if page_idx not in annots_dict:
                annots_dict[page_idx] = []

            text = "".join(annot.text).strip()
            # 去掉换行符
            text = text.replace("-n", "").replace("n", " ")
            annots_dict[page_idx].append(text)
    return annots_dict

if __name__ == "__main__":
    print(get_pdf_annots("xxx.pdf"))