Pdf library for python. Какую библиотеку выбрать?!

zaytceva · 14.Июль.2014 10:55:02

Доброго дня.
Власне, шукаю бібліотеку, для роботи з pdf файлами.
Потрібно реалізувати аналіз таблиць, які розміщені у pdf файлі. Формат таблиці приблизно такий:

_______________________________________________________________________
|Object | Message | Description                   | Solution          |
_______________________________________________________________________
|Human  | Headache  | head is in pain               | drink some pills  |
|       |_____________________________________________________________ |                                                                                                            |       | Sadness  | Don't want to do a thing      | visit friends,    |
|       |          | never smiles. Looks sad       | have a vacation   |

Була б вдячна за приклади використання функцій, які можуть знадобитись.
Я новачок, тому прошу вибачення за нечітке чи елементарне питання.

polusok · 14.Июль.2014 20:27:32

Ну из того что помню это:

pdfminer http://www.unixuser.org/~euske/python/pdfminer/index.html
pypdf PyPDF2 · PyPI pyPdf

Но давно не пользовался ими, так что надо проверять, работают ли они в вашем случае или нет.
Происследуйте функциональность, а если уже в тупик зайдете тогда, еще раз помогу.

А вот сразу пару ссылочек по теме

polusok · 13.Ноябрь.2014 18:43:16

Вот нашел хорошую статью по этому поводу http://www.binpress.com/tutorial/manipulating-pdfs-with-python/167. Сделал выдержку библиотек и ссылок на них:

pdfrw : Last update: 2012. Read and write PDF files; watermarking, copying images from one PDF to another. Includes sample code. Python 2.5–2.7. MIT License. https://code.google.com/p/pdfrw/
slate : Active development. Simplifies extracting text from PDF files. Wrapper around PDFMiner. Includes documentation on GitHub and PyPI. Python 2.6. GPL License. GitHub - timClicks/slate: The simplest way to extract text from PDFs in Python
PDFQuery : Active development. PDF scraping with Jquery or XPath syntax. Requires PDFMiner, pyquery and lxml libraries. Includes sample code, documentation. Seems to be Python 2.x. MIT License. GitHub - jcushman/pdfquery: A fast and friendly PDF scraping library.
PDFMiner : Active development. Extracting text, images, object coordinates, metadata from PDF files. Pure Python. Includes sample code and command line interface; Google group and documentation. Python 2.x only. MIT License. GitHub - euske/pdfminer: Python PDF Parser (Not actively maintained). Check out pdfminer.six.
PyPDF2 : Active development. Split, merge, crop, etc. of PDF files. Pure Python. Includes sample code and command line interface, documentation. Python 2 and 3. BSD License. GitHub - py-pdf/pypdf: A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
reportlab : Python package. Create PDF documents as well as vector and bitmap images. ReportLab Docs
pdftk : GUI and command line. Merge, split PDF files, and more. PDFtk - The PDF Toolkit
fdfgen : Python package. Generates an FDF file containing form data that can be used with pdftk to populate a PDF form. GitHub - ccnmtl/fdfgen: port of PDF fdfgen library for filling in PDF forms to Python
qpdf : C++ library and program suite. Transforms PDF files. Useful for linearizing/optimizing uncompressing, and encryption. GitHub - ccnmtl/fdfgen: port of PDF fdfgen library for filling in PDF forms to Python
ghostscript : Interpreter for Postscript and PDF. http://www.ghostscript.com/
XPDF : Open source project. Contains several useful tools such as pdffonts and pdfinfo. XpdfReader
pdffonts : lists fonts used in a PDF file including information on font type, whether the font is embedded, etc. Part of the open-source Xpdf project. Licensed under GPL v2.

А также добавил в whishlist http://lessons2.ru, чтобы сделать пару практических уроков с заданиями по этому поводу.

dgk · 02.Октябрь.2015 06:46:57

перепробовали несколько упомянутых тут библиотек типа reportlab, в последних проектах используем http://weasyprint.org/ - генерит pdf из html, очень удобно. Правда, бывает приходится повозиться с установкой, не припомню уже с чем, но stackoverflow решает