Discussion:
Convert PDF to .tex file?
(too old to reply)
timorrill
2008-06-03 14:10:40 UTC
Permalink
I'm wondering if it's possible to create a .tex file from a PDF
document. Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to write
it all out manually.
Uwe Ziegenhagen
2008-06-03 14:16:14 UTC
Permalink
Post by timorrill
I'm wondering if it's possible to create a .tex file from a PDF
document. Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to write
it all out manually.
Short: Impossible.

Long: I know no tool which might be able to do this.

Uwe
Uwe Ziegenhagen
2008-06-03 14:18:04 UTC
Permalink
Post by Uwe Ziegenhagen
Post by timorrill
I'm wondering if it's possible to create a .tex file from a PDF
document. Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to write
it all out manually.
Short: Impossible.
Long: I know no tool which might be able to do this.
Uwe
BTW: If you do not need to modify the math, crop the pages with Acrobat
Pro (or whatever the name of the commercial license is now) or PDFTK
(maybe, never used it) and embed them as graphics.

Uwe
Rolf Niepraschk
2008-06-03 14:26:41 UTC
Permalink
Uwe Ziegenhagen schrieb:
...
Post by Uwe Ziegenhagen
BTW: If you do not need to modify the math, crop the pages with Acrobat
Pro (or whatever the name of the commercial license is now) or PDFTK
(maybe, never used it) and embed them as graphics.
cropping is also possible with pdfLaTeX:

\includegraphics[viewport=u v w x]{file}

...Rolf
A N Niel
2008-06-03 14:32:39 UTC
Permalink
In article
Post by timorrill
I'm wondering if it's possible to create a .tex file from a PDF
document. Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to write
it all out manually.
This is like: "Can I create the movie script from the finished film?"

Or: "Can I create the recipe from that meal they served me?"
Rolf Niepraschk
2008-06-03 14:36:11 UTC
Permalink
Post by A N Niel
In article
Post by timorrill
I'm wondering if it's possible to create a .tex file from a PDF
document. Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to write
it all out manually.
This is like: "Can I create the movie script from the finished film?"
Or: "Can I create the recipe from that meal they served me?"
Or: "Can I create apples from apple puree?"

...Rolf
Ted Pavlic
2008-06-03 20:48:47 UTC
Permalink
Post by Rolf Niepraschk
Or: "Can I create apples from apple puree?"
...Rolf
I'm not sure it's that useful to consider this branch of the thread,
but...

Considering that the PDF may not have been created with TeX to begin
with, perhaps...

"Can I create apples from concentrated orange juice?"

or...

"Can I create a recipe from a shooting star?"

or...

"Can I create the movie script from the banana-flavored toothpaste?"
William F. Adams
2008-06-03 14:46:24 UTC
Permalink
Post by timorrill
I'm wondering if it's possible to create a .tex file from a PDF
document. Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to write
it all out manually.
There're a couple of tools which attempt OCR which includes
mathematics, for example:

http://research.cs.queensu.ca/drl//ffes/

Convert the .pdf to a bitmap, then feed it to ffes.

William
c***@congster.de
2008-06-05 08:03:34 UTC
Permalink
Post by William F. Adams
Post by timorrill
I'm wondering if it's possible to create a .tex file from a PDF
document. Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to write
it all out manually.
There're a couple of tools which attempt OCR which includes
http://research.cs.queensu.ca/drl//ffes/
Convert the .pdf to a bitmap, then feed it to ffes.
William
It's actually unbelievable how well you can reconstruct the cow from
the hamburger:

http://www.inftyproject.org/en/software.html#InftyReader

Didn't test it, though.

Kurt
r***@rit.edu
2008-06-06 17:40:58 UTC
Permalink
Post by William F. Adams
Post by timorrill
I'm wondering if it's possible to create a .tex file from a PDF
document. Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to write
it all out manually.
There're a couple of tools which attempt OCR which includes
http://research.cs.queensu.ca/drl//ffes/
Convert the .pdf to a bitmap, then feed it toffes.
Thought I should point out that FFES is a prototype for pen-based math
entry, and does not converting images directly to .tex at this time.
There is a preliminary, experimental part of the program for importing
images, but it's fairly weak at the present time. Also, for those
interested, there is a newer version of FFES available here:

http://www.cs.rit.edu/~rlaz/ffes/

I believe that the Infty system of Suzuki et al. does support
conversion from images to .tex, but have not had time to try the
system myself.

-Richard Zanibbi (member of the FFES development team, FFES maintainer)
Ted Pavlic
2008-06-06 19:56:28 UTC
Permalink
Post by r***@rit.edu
http://www.cs.rit.edu/~rlaz/ffes/
-Richard Zanibbi (member of the FFES development team, FFES maintainer)
Slightly off topic -- if you try to install the distribution that's on-
line, it's going to fail when it tests the TXL compiler... From the
test_txl called from the Makefile for the DRACULAE_0.4 directory:

COMPILE_TEST=`cd test; txlc test/Test.Txl`

I think that "test/" should be removed. Additionally, in that DRACULAE
Makefile, I had to change the *.x rule to wrap a $< by a basename.
That is, you're doing a "cd src" and then still using "src."

I'm running OS/X 10.4. After making those changes, I was able to build
ffes fine.

--Ted
r***@rit.edu
2008-06-11 19:27:58 UTC
Permalink
Post by Ted Pavlic
Post by r***@rit.edu
http://www.cs.rit.edu/~rlaz/ffes/
-Richard Zanibbi (member of theFFESdevelopment team,FFESmaintainer)
Slightly off topic -- if you try to install the distribution that's on-
line, it's going to fail when it tests the TXL compiler... From the
COMPILE_TEST=`cd test; txlc test/Test.Txl`
I think that "test/" should be removed. Additionally, in that DRACULAE
Makefile, I had to change the *.x rule to wrap a $< by a basename.
That is, you're doing a "cd src" and then still using "src."
I'm running OS/X 10.4. After making those changes, I was able to buildffesfine.
--Ted
Thank you for catching this. I will update these files when I get the
chance.

-Richard Zanibbi
Luite
2008-06-12 08:17:57 UTC
Permalink
Post by c***@congster.de
There're a couple of tools which attemptOCRwhich includes
http://research.cs.queensu.ca/drl//ffes/
It's actually unbelievable how well you can reconstruct the cow from
Do you think we can put a copy of the cow into the hamburger?
What I mean is: can pdf(la)tex somehow put the original tex code into
the pdf? I don't know what the pdf specs say about this, but I seem to
remember that pdf's can have embedded files (attachments). It would
increase the chances of the document being convertable to a new
standard in 30 or 100 years.

cherio, Luite.
Ken Starks
2008-06-12 10:00:22 UTC
Permalink
Post by Luite
Post by c***@congster.de
There're a couple of tools which attemptOCRwhich includes
http://research.cs.queensu.ca/drl//ffes/
It's actually unbelievable how well you can reconstruct the cow from
Do you think we can put a copy of the cow into the hamburger?
What I mean is: can pdf(la)tex somehow put the original tex code into
the pdf? I don't know what the pdf specs say about this, but I seem to
remember that pdf's can have embedded files (attachments). It would
increase the chances of the document being convertable to a new
standard in 30 or 100 years.
cherio, Luite.
The most promising approach is likely to be the xml functionality
of pdf--see the adobe sire and the `mars' project for this.

Meanwhile, you can put anything you like into the pdf as
a comment (I DO mean comment, not comment-annotation).
PDF comments start with % and last until the end of the line.
William F. Adams
2008-06-12 12:45:56 UTC
Permalink
Post by Luite
Post by c***@congster.de
There're a couple of tools which attemptOCRwhich includes
http://research.cs.queensu.ca/drl//ffes/
It's actually unbelievable how well you can reconstruct the cow from
Do you think we can put a copy of the cow into the hamburger?
What I mean is: can pdf(la)tex somehow put the original tex code into
the pdf? I don't know what the pdf specs say about this, but I seem to
remember that pdf's can have embedded files (attachments). It would
increase the chances of the document being convertable to a new
standard in 30 or 100 years.
That's not what he means, but yes, one can store a copy of the .tex
source (or any other file) w/in a .pdf when typesetting / creating it.

The Mac OS X Service app LaTeXiT.app (among others) does this, which
allows an embedded equation to be reverted back to its source for
editing, then re-typesetting.

William
Heiko Oberdiek
2008-06-12 14:27:18 UTC
Permalink
Post by Luite
Post by c***@congster.de
There're a couple of tools which attemptOCRwhich includes
http://research.cs.queensu.ca/drl//ffes/
It's actually unbelievable how well you can reconstruct the cow from
Do you think we can put a copy of the cow into the hamburger?
What I mean is: can pdf(la)tex somehow put the original tex code into
the pdf?
Easy, look at package embedfile or attachfile2 (or attachfile).

Yours sincerely
Heiko <***@uni-freiburg.de>
Ted Pavlic
2008-06-12 15:26:13 UTC
Permalink
Post by Heiko Oberdiek
Post by Luite
Do you think we can put a copy of the cow into the hamburger?
What I mean is: can pdf(la)tex somehow put the original tex code into
the pdf?
Easy, look at package embedfile or attachfile2 (or attachfile).
I assume that these packages require the use of pdftex. That is, they
require generating a PDF directly from TeX, which may not be appealing
for many users (including this one).

Is there a way to embed the TeX into a DVI and then still manage to
maintain it through the dvips and ps2pdf pipeline? (I assume not)

--Ted
Peter Flynn
2008-06-03 20:55:59 UTC
Permalink
Post by timorrill
I'm wondering if it's possible to create a .tex file from a PDF
document.
This is like asking to recreate the whole cow from a hamburger.
Post by timorrill
Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to write
it all out manually.
Find the original source and use that. Reverse-engineering may be
possible, but it will take longer than retyping it.

///Peter
Bob Tennent
2008-06-03 21:59:56 UTC
Permalink
Post by Peter Flynn
This is like asking to recreate the whole cow from a hamburger.
Enough of this.

The fact is that Adobe Acrobat can often create a usable .doc from a
PDF, though this likely works well only with ordinary text documents.
It's unfortunate a comparable free application doesn't exist.

Bob T.
David Kastrup
2008-06-03 22:22:01 UTC
Permalink
Post by Bob Tennent
Post by Peter Flynn
This is like asking to recreate the whole cow from a hamburger.
Enough of this.
The fact is that Adobe Acrobat can often create a usable .doc from a
PDF, though this likely works well only with ordinary text documents.
It's unfortunate a comparable free application doesn't exist.
Ah, but this depends on what one calls "usable". Usable means the
consistent use of style sheets, cross references and stuff like that.
That 95% of WYSIWYG system users will go "Huh? What's that?" does not
change that you don't want a 1000-page document without such basic
elements in them.

Regardless whether it has been produced by Acrobat, a clueless retyper,
a clueless original typer or a free tool.
--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
UKTUG FAQ: <URL:http://www.tex.ac.uk/cgi-bin/texfaq2html>
Ista Zahn
2008-06-03 22:37:21 UTC
Permalink
Post by Bob Tennent
Post by Peter Flynn
This is like asking to recreate the whole cow from a hamburger.
Enough of this.
The fact is that Adobe Acrobat can often create a usable .doc from a
PDF, though this likely works well only with ordinary text documents.
It's unfortunate a comparable free application doesn't exist.
Bob T.
In fact you can convert from pdf to .doc using free tools. If you are on
linux, kword can import pdf and export to formats that MS word can read.
Or you can use pdftohtml and then convert the html to .doc. Or you can
sign up for gmail from google, email the pdf to yourself, and have
google convert it to html (and then convert the html to .doc format).
None of these methods will do what the OP wanted of course (convert math
in a pdf to latex), but then again neither will Adobe Acrobat...
Bob Tennent
2008-06-04 10:25:44 UTC
Permalink
Post by Ista Zahn
Post by Bob Tennent
Post by Peter Flynn
This is like asking to recreate the whole cow from a hamburger.
Enough of this.
The fact is that Adobe Acrobat can often create a usable .doc from a
PDF, though this likely works well only with ordinary text documents.
It's unfortunate a comparable free application doesn't exist.
In fact you can convert from pdf to .doc using free tools.
What I meant by comparable was to convert .pdf to .tex. I'm aware it is
possible to go from .pdf to .doc and then .doc to .tex using Abiword,
but surely we could and should do better.

My main point was that it is inappropriate to use irrelevant analogies
to mock the OP's request.

Bob T.
Robin Fairbairns
2008-06-04 13:25:54 UTC
Permalink
Post by Bob Tennent
Post by Ista Zahn
Post by Bob Tennent
Post by Peter Flynn
This is like asking to recreate the whole cow from a hamburger.
Enough of this.
The fact is that Adobe Acrobat can often create a usable .doc from a
PDF, though this likely works well only with ordinary text documents.
It's unfortunate a comparable free application doesn't exist.
In fact you can convert from pdf to .doc using free tools.
What I meant by comparable was to convert .pdf to .tex. I'm aware it is
possible to go from .pdf to .doc and then .doc to .tex using Abiword,
but surely we could and should do better.
My main point was that it is inappropriate to use irrelevant analogies
to mock the OP's request.
there is a faq answer that says (in effect) that there's no point in
even trying anything beyond extracting the text. this thread is the
first time anyone's mentioned anything else ... rescanning printed
output sounds (ahem) "fun".

anyway, i shall revise the answer some time.
--
Robin Fairbairns, Cambridge
Wilfried Hennings
2008-06-04 18:06:56 UTC
Permalink
Post by Robin Fairbairns
Post by Bob Tennent
What I meant by comparable was to convert .pdf to .tex. I'm aware it is
possible to go from .pdf to .doc and then .doc to .tex using Abiword,
but surely we could and should do better.
there is a faq answer that says (in effect) that there's no point in
even trying anything beyond extracting the text. this thread is the
first time anyone's mentioned anything else ... rescanning printed
output sounds (ahem) "fun".
There is no need to "rescan printed output".
Modern OCR software (commercial: Caere OmniPage, Abbyy FineReader) can
directly read pdf, convert it to a bitmap and OCR this bitmap. Of
course the quality is better than with printing and rescanning.
And if you want to do it manually, you can open the pdf with
Ghostscript and convert it to a bitmap, then apply the OCR of your
choice.

This OCR software can also guess formatting (not perfect, but
useable).
Drawback: It saves in MS Word format, not (La)TeX.


Wilfried Hennings
please reply in the newsgroup
Luis Rivera
2008-06-04 18:36:36 UTC
Permalink
Post by Robin Fairbairns
Post by Bob Tennent
Post by Bob Tennent
Post by Peter Flynn
This is like asking to recreate the whole cow from a hamburger.
Enough of this.
My main point was that it is inappropriate to use irrelevant analogies
to mock the OP's request.
Just one more analogy!!! (I hope it is relevant): I always thought it
was funny to think of TeX as an animal (a worm: a book worm if you
like it ;-), with eyes, mouth, gullet, stomach, and guts: in the end,
you get TeX's poop (in whatever format) and deliver it on paper,
plastic, or screen. Gross...

So, in the end, the OP is asking something like trying to make a
burger from... er... you know... Of course you can, but it will take
time and effort (make fertilizer, grow hay, raise the cow, and make
the burger; perhaps even coding the cow, so to speak); and probably
won't work anyway (if the PDF was made by Equation Editor or whatever,
and equations are displayed precisely as images).
Post by Robin Fairbairns
there is a faq answer that says (in effect) that there's no point in
even trying anything beyond extracting the text. this thread is the
first time anyone's mentioned anything else ... rescanning printed
output sounds (ahem) "fun".
The problem is that these formats (DVI, PS, PDF) are actually suited
for visual display, and going back from them inevitably drops some
information. The easiest thing to do, as suggested, is to crop the
displayed math as images (at a suitable resolution), and try to
recover the text, by whatever methods available. You may retype the
equations later, one by one, by hand. Other textual markup (ODF, XML,
MathML, or whatever) could be easier to recover.

I hope I haven't spoiled anybody's dinner :o)

Louie.
d***@gmail.com
2018-07-19 12:32:44 UTC
Permalink
Post by timorrill
I'm wondering if it's possible to create a .tex file from a PDF
document. Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to write
it all out manually.
Robert Heller
2018-07-19 15:29:15 UTC
Permalink
It will likely not be possible to recover the original LaTeX code. It might
be possible to extract the text, *as printed*. How useful that will be in
recreating the LaTeX code is uncertain.
Post by timorrill
I'm wondering if it's possible to create a .tex file from a PDF
document. Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to write
it all out manually.
--
Robert Heller -- 978-544-6933
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
***@deepsoft.com -- Webhosting Services
Peter Flynn
2018-07-19 22:03:16 UTC
Permalink
Post by timorrill
I'm wondering if it's possible to create a .tex file from a PDF
document. Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to
write it all out manually.
Nope. PDF is an end-of-line format intended for screen reading or
printing. It does not contain any information about how the characters
or symbols got where they are, only about the position they are in on
the page, how big, what colour, etc. All the information about *why*
stuff is where it is, is omitted once it has been used to do the
typesetting.

Ideally you need to go back to the author and ask them for a copy of the
original LaTeX document, but that isn't always possible.

HOWEVER...

1. It is possible to extract just the text (pdftotext is part of Xpdf,
see http://www.xpdfreader.com/) which comes out as plain text, one line
per paragraph, with a ^L (formfeed character) between pages. Mathematics
comes out as a jumble of unusable nonsense.

2. Apache PDFBox is a Java .jar utility to extract text from PDFs
https://pdfbox.apache.org/download.cgi and if you pick HTML output it
will preserve bold and italic as well as paragraphs. I have no idea what
it would do with math, probably the same as [1].

I have used both of these and they are excellent for what they can do.

3. There are dozens, perhaps hundreds, of commercial systems claiming to
extract material from PDFs into Word, preserving all the formatting.
Some of these are standalone programs you run yourself, some are web
sites or services, sometimes free, sometimes limited. I have never used
any of them.

4. There is a LOT of research going on about extraction from PDF.
Leading lights like Peter Murray-Rust have written programs which will
extract even tables from PDFs to SVG (not LaTeX, but an advance). All
part of the movement towards open publication and preventing publishers
from locking up material that they have no rights to; see
https://pdfliberation.wordpress.com/

5. There is also some (less, I think, but I may just not have seen it)
work going on to extract mathematics direct from the positional
information in PDFs, but it is experimental, although there is a book
about it.¹

6. There have been reported successes, however, in using math OCR to
extract the equations from the printout. See
https://tex.stackexchange.com/questions/266989/ocr-pdf-image-to-latex-math
for using pdfocr and tesseract, which has some understanding of math. I
have used tesseract and it's great OCR, but I haven't tried it for maths.

///Peter
--
¹ @InProceedings{10.1007/978-3-319-11897-0_20,
author="Yu, Botao and Tian, Xuedong and Luo, Wenjie",
editor="Tan, Ying and Shi, Yuhui and Coello, Carlos A.",
title="Extracting Mathematical Components Directly from PDF Documents
for Mathematical Expression Recognition and Retrieval",
booktitle="Advances in Swarm Intelligence",
year="2014",
publisher="Springer International Publishing",
address="Cham",
pages="170--179",
abstract="PDF document gains its popularity in information storage and
exchange. With more and more documents, especially the scientific
documents, available in PDF format, extracting mathematical expressions
in PDF documents becomes an important issue in the field of mathematical
expression recognition and retrieval. In this paper, we proposed a
method of extracting mathematical components directly from PDF documents
rather than cooperating indirectly with corresponding images converted
from PDF files. Compared with traditional image-based method, the
proposed method makes full use of the internal information of PDF
documents such as font size, baseline, glyph bounding box and so on to
extract the mathematical characters and their geometric information. The
experimental result shows the method could meet the needs of the
following processing of mathematical expressions such as formula
structural analysis, reconstruction and retrieval, and has a higher
efficiency than traditional image-based ways.",
isbn="978-3-319-11897-0"
}
Martin Vaeth
2018-07-20 06:25:38 UTC
Permalink
Post by Peter Flynn
6. There have been reported successes, however, in using math OCR to
extract the equations from the printout.
If an OCR program can do it with quite success, it should be even much
simpler to do it from PDF directly. However, AFAIK nobody has done it yet,
and it is quite demanding (and hard to make error-free in some corner cases;
even humans sometimes make errors there if they cannot conclude it from the
content).
One might want to inspect the corresponding part of the OCR program
or do some experiments with machine learning. Perhaps a semester
project for an interested student.
Peter Flynn
2018-07-20 18:48:25 UTC
Permalink
Post by Martin Vaeth
Post by Peter Flynn
6. There have been reported successes, however, in using math OCR to
extract the equations from the printout.
If an OCR program can do it with quite success, it should be even much
simpler to do it from PDF directly. However, AFAIK nobody has done it yet,
and it is quite demanding (and hard to make error-free in some corner cases;
even humans sometimes make errors there if they cannot conclude it from the
content).
One might want to inspect the corresponding part of the OCR program
or do some experiments with machine learning. Perhaps a semester
project for an interested student.
I suspect that it's easier to write using the OCR program because that
looks at the scanned bitmap and they already have robust routines to do
character-recognition and positional analysis. In a PDF, there's no
bitmap, so you have to work with the (x,y) coordinates (or deduce them),
although you do at least get handed the character identity. The authors
of the book I cited claim to have do it from a PDF. Definitely a case of
more research needed.

///Peter
Axel Berger
2018-07-20 20:23:04 UTC
Permalink
using the OCR program because that looks at the scanned bitmap
I've no idea what it does internally but I can feed my OCR (Abbyy
Version 5) a non bitmap PDF and get a result.
--
/¯\ No | Dipl.-Ing. F. Axel Berger Tel: +49/ 221/ 7771 8067
\ / HTML | Roald-Amundsen-Straße 2a Fax: +49/ 221/ 7771 8069
 X in | D-50829 Köln-Ossendorf http://berger-odenthal.de
/ \ Mail | -- No unannounced, large, binary attachments, please! --
Axel Berger
2018-07-20 07:11:37 UTC
Permalink
It is possible to extract just the text which comes out as plain text
As you correctly say, there is no text in the PDF, only the placement of
individual letters. Assembly of text is done by heuristics and can go
wrong. The most common error is not to recognize inter word spaces,
especially in front of "w"s.

Secondly PDF does not even know letters, only funny graphic shapes. They
can be listed internally in the order of the letters they represent, but
they need not. Sometimes all you get is some kind of gobbledegook. If
you enjoy code breaking it can be read, but it's a lot of work.

That said, mostly the results from pdftotext are just fine and some
times the options raw or layout help, if the standard output has issues.

As PDF has no problem with faint printing or smudged scans it is also
the ideal source for a good OCR.
--
/¯\ No | Dipl.-Ing. F. Axel Berger Tel: +49/ 221/ 7771 8067
\ / HTML | Roald-Amundsen-Straße 2a Fax: +49/ 221/ 7771 8069
 X in | D-50829 Köln-Ossendorf http://berger-odenthal.de
/ \ Mail | -- No unannounced, large, binary attachments, please! --
Loading...