Post by timorrill
I'm wondering if it's possible to create a .tex file from a PDF
document. Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to
write it all out manually.
Nope. PDF is an end-of-line format intended for screen reading or
printing. It does not contain any information about how the characters
or symbols got where they are, only about the position they are in on
the page, how big, what colour, etc. All the information about *why*
stuff is where it is, is omitted once it has been used to do the
Ideally you need to go back to the author and ask them for a copy of the
original LaTeX document, but that isn't always possible.
1. It is possible to extract just the text (pdftotext is part of Xpdf,
see http://www.xpdfreader.com/) which comes out as plain text, one line
per paragraph, with a ^L (formfeed character) between pages. Mathematics
comes out as a jumble of unusable nonsense.
2. Apache PDFBox is a Java .jar utility to extract text from PDFs
https://pdfbox.apache.org/download.cgi and if you pick HTML output it
will preserve bold and italic as well as paragraphs. I have no idea what
it would do with math, probably the same as .
I have used both of these and they are excellent for what they can do.
3. There are dozens, perhaps hundreds, of commercial systems claiming to
extract material from PDFs into Word, preserving all the formatting.
Some of these are standalone programs you run yourself, some are web
sites or services, sometimes free, sometimes limited. I have never used
any of them.
4. There is a LOT of research going on about extraction from PDF.
Leading lights like Peter Murray-Rust have written programs which will
extract even tables from PDFs to SVG (not LaTeX, but an advance). All
part of the movement towards open publication and preventing publishers
from locking up material that they have no rights to; see
5. There is also some (less, I think, but I may just not have seen it)
work going on to extract mathematics direct from the positional
information in PDFs, but it is experimental, although there is a book
6. There have been reported successes, however, in using math OCR to
extract the equations from the printout. See
for using pdfocr and tesseract, which has some understanding of math. I
have used tesseract and it's great OCR, but I haven't tried it for maths.
author="Yu, Botao and Tian, Xuedong and Luo, Wenjie",
editor="Tan, Ying and Shi, Yuhui and Coello, Carlos A.",
title="Extracting Mathematical Components Directly from PDF Documents
for Mathematical Expression Recognition and Retrieval",
booktitle="Advances in Swarm Intelligence",
publisher="Springer International Publishing",
abstract="PDF document gains its popularity in information storage and
exchange. With more and more documents, especially the scientific
documents, available in PDF format, extracting mathematical expressions
in PDF documents becomes an important issue in the field of mathematical
expression recognition and retrieval. In this paper, we proposed a
method of extracting mathematical components directly from PDF documents
rather than cooperating indirectly with corresponding images converted
from PDF files. Compared with traditional image-based method, the
proposed method makes full use of the internal information of PDF
documents such as font size, baseline, glyph bounding box and so on to
extract the mathematical characters and their geometric information. The
experimental result shows the method could meet the needs of the
following processing of mathematical expressions such as formula
structural analysis, reconstruction and retrieval, and has a higher
efficiency than traditional image-based ways.",