*Post by timorrill*I'm wondering if it's possible to create a .tex file from a PDF

document. Specifically, I want to be able to convert the math that

appears in a PDF document to LaTeX code, so that I don't have to

write it all out manually.

Nope. PDF is an end-of-line format intended for screen reading or

printing. It does not contain any information about how the characters

or symbols got where they are, only about the position they are in on

the page, how big, what colour, etc. All the information about *why*

stuff is where it is, is omitted once it has been used to do the

typesetting.

Ideally you need to go back to the author and ask them for a copy of the

original LaTeX document, but that isn't always possible.

HOWEVER...

1. It is possible to extract just the text (pdftotext is part of Xpdf,

see http://www.xpdfreader.com/) which comes out as plain text, one line

per paragraph, with a ^L (formfeed character) between pages. Mathematics

comes out as a jumble of unusable nonsense.

2. Apache PDFBox is a Java .jar utility to extract text from PDFs

https://pdfbox.apache.org/download.cgi and if you pick HTML output it

will preserve bold and italic as well as paragraphs. I have no idea what

it would do with math, probably the same as [1].

I have used both of these and they are excellent for what they can do.

3. There are dozens, perhaps hundreds, of commercial systems claiming to

extract material from PDFs into Word, preserving all the formatting.

Some of these are standalone programs you run yourself, some are web

sites or services, sometimes free, sometimes limited. I have never used

any of them.

4. There is a LOT of research going on about extraction from PDF.

Leading lights like Peter Murray-Rust have written programs which will

extract even tables from PDFs to SVG (not LaTeX, but an advance). All

part of the movement towards open publication and preventing publishers

from locking up material that they have no rights to; see

https://pdfliberation.wordpress.com/

5. There is also some (less, I think, but I may just not have seen it)

work going on to extract mathematics direct from the positional

information in PDFs, but it is experimental, although there is a book

about it.¹

6. There have been reported successes, however, in using math OCR to

extract the equations from the printout. See

https://tex.stackexchange.com/questions/266989/ocr-pdf-image-to-latex-math

for using pdfocr and tesseract, which has some understanding of math. I

have used tesseract and it's great OCR, but I haven't tried it for maths.

///Peter

--

¹ @InProceedings{10.1007/978-3-319-11897-0_20,

author="Yu, Botao and Tian, Xuedong and Luo, Wenjie",

editor="Tan, Ying and Shi, Yuhui and Coello, Carlos A.",

title="Extracting Mathematical Components Directly from PDF Documents

for Mathematical Expression Recognition and Retrieval",

booktitle="Advances in Swarm Intelligence",

year="2014",

publisher="Springer International Publishing",

address="Cham",

pages="170--179",

abstract="PDF document gains its popularity in information storage and

exchange. With more and more documents, especially the scientific

documents, available in PDF format, extracting mathematical expressions

in PDF documents becomes an important issue in the field of mathematical

expression recognition and retrieval. In this paper, we proposed a

method of extracting mathematical components directly from PDF documents

rather than cooperating indirectly with corresponding images converted

from PDF files. Compared with traditional image-based method, the

proposed method makes full use of the internal information of PDF

documents such as font size, baseline, glyph bounding box and so on to

extract the mathematical characters and their geometric information. The

experimental result shows the method could meet the needs of the

following processing of mathematical expressions such as formula

structural analysis, reconstruction and retrieval, and has a higher

efficiency than traditional image-based ways.",

isbn="978-3-319-11897-0"

}