OpenOffice.org Forum at OOoForum.orgThe OpenOffice.org Forum
 
 [Home]   [FAQ]   [Search]   [Memberlist]   [Usergroups]   [Register
 [Profile]   [Log in to check your private messages]   [Log in

convert PDF to OOo formats?
Goto page Previous  1, 2, 3, 4
 
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Writer
View previous topic :: View next topic  
Author Message
DannyB
Moderator
Moderator


Joined: 02 Apr 2003
Posts: 3991
Location: Lawrence, Kansas, USA

PostPosted: Wed Jun 16, 2004 7:11 am    Post subject: Re: pdf = vector? Reply with quote

Anonymous wrote:
It seems to me that it is a little misleading to describe PDF as a vector format. Especially since it can contain raster images. A better way to describe it would be as a document format. that's nice and vague. Smile


In most vector formats I've ever met (i.e. OOo Draw, MacDraw, QuickDraw Pict, Windows Metafile, Postscript), a raster image is a single "operation" of the vector format. That is, a page may contain any combination of lines, texts, bitmaps, ellipses, polygons, etc. The fact that a document can be made of pages, where each page contains only a single very large raster image does not make it any less a vector format.

For example, I could make an OOo Draw document where each page is just a large bitmap shape. Save it as a Draw document. The document is still fundamentally a vector format even though the bulk of data in this particular example is made up of raster pixels.

In no way do I want to mislead. I'll say it this way. From everything I've read about the pdf format, it seems to me to be a vector format which holds, shapes, lines, texts, raster images, etc.


Quote:
What I think would be of most value, as far as a PDF import filter goes, would be to create the filter such that it parses the PDF, identifies raster components and then converts said raster components into a vector format. Then with every component in a vector format apply pattern recognition to ascertain what the equivalent characters, font and whatnot are.


Converting raster to vector is quite a trick, and subject to much subjective interpretation. One approach is to take a raster, and then "posterize" it somewhat. (See Gimp or Photoshop for posterize operation.) Then turn large blotches of single colors into, say, poly-bezier-gons.

OCR ing images into text is a seperate but related problem.

If I ever, ever, get around to trying to build a PDF import filter, which is a huge potential project, (but a fun one), I would not even try to OCR stuff. I would just be faithfully bringing in a raster (even if it is full page) and placing it faithfully on the Draw page where it goes.


Quote:
At least this is the behavior I would expect for an import filter into Writer. For Draw I would expect that all the components would remain in their original format wherever possible though.


I think you've just hit on why I assert that Draw is the natural import target for PDF, not Writer.

Doing an additional complex OCR operation, to turn some pixels into text would be a very cold additional feature. Even if it were done only in Draw. It may then be possible to place Frames in writer and essentially "lay out" the page into a Writer document. There would be the problem of recognizing that the text happens to naturally fit into the structure of Writer. That is, the main body is paragraphs, perhaps divided into columns, and frames, and text flowing around shapes. Bringing PDF into Writer almost requires PDF->Draw as a prerequisite. Then add the layers of complexity to attempt to properly format it back into Writer.

I understand why people want PDF -> Writer. And it might be possible for a subset of PDF documents. But all PDFs can be faithfully reproduced in Draw, and then turned right back around into PDF again with little or no loss of fidelity.


Quote:
On the whole, I question the real value of a PDF import filter, but clearly its desirable since this is not the first thread on the subject. Open source raster to vector tools already exist as do character recognition tools. So its really a matter of pulling it all together into a reasonable implementation.


If there were a Python pdf parser that parsed a pdf into a large tree of objects, this would be a huge first step. If you read the pdf document specs, you can see that a pdf is made up of tokens. The tokens form some basic data types, such as long, floating point, string, symbol, and dictionary. A number of more complex types (such as a font) are made up out of dictionaries of dictionaries.

At this time, I'm still working past the baby steps of building components in python. That is, I'm still focused on many of the "plumbing" issues. Custom interfaces. How to load libraries in Python (Cybb20, if you're reading this, I have a theory on how to do this.). Multiple services in a single component. And many other basic issues to solve. Right now I'm working on the library loader problem because I have enough code that it is not practical to put it all into one huge text file, as OOo seems to want you to do.

As for value, I would love to have a PDF import, even if were just to Draw. I would love a (general purpose) PDF to Writer, but am not convinced that as many PDF's would work (well) with it. Writer does have a "draw page". If I ever got this far, I could place all the shapes into Writer as floating frames, shapes, bitmaps, etc. What you would end up with would indeed be a Writer doc, but the main text body, headers, footers, etc. would be empty!
_________________
Want to make OOo Drawings like the colored flower design to the left?
Back to top
View user's profile Send private message
Graaf
General User
General User


Joined: 26 Aug 2004
Posts: 33
Location: Finland

PostPosted: Tue Aug 31, 2004 2:14 pm    Post subject: Reply with quote

Phew! Quite a thread. I think I can shed some light to this conversation. I'm graphic designer and I've been using vector based drawing (Freehand, Illustrator) and publishing (QuarkXPress, InDesign) softwares for years.

PDF Format
It seems to me that there is a very confused image of pdf's nature in here. DannyB is about the only one who really understands what pdf is all about:

The format is indeed quite “nice and vague”, but essentially vector based. Almost all vector formats supports exactly the same set of features (text, bezier curves, bitmap etc.) as pdf. There is no format that is “pure vector” ie only bezier curves. Even if there were, it wouldn't be practically very useful.

I think it's best to think vector formats as containers which can embed (or in some cases, link) many different kinds of data. Exactly as DannyB said.

PDF vs. SVG and other formats
While originally designed as a rather static format to simply provide standardized means to transfer digital material to the printing houses, the popularity of pdf has soared as it were (an still is) the only truly interchangeable format available. The need to be compatible with printing houses, who adopted pdf eagerly, made it necessary to add the support for it to the vector based programs. Subsequently, it has become one of the most “standard” formats available in these programs. The fact that Adobe made the format "open" is in the key. All other formats are more or less closed and proprietary (many flavors of .ai, .cdr, .fh9, .fh10, fh11 etc.). .pdf is the .doc of graphic industry in terms of acceptance.

Pdf is not the best possible format as it still bears the burdens of its original design: there are still things that pdf does not support or does so poorly. It is however about the best we have.

I understand your enthusiasm about SVG and agree that it has much promise. But how many reputable vector/publishing software currently supports it? I mean those softwares that are used by professionals. One, namely Adobe Illustrator (import only!). Is it going to gain much acceptance in that field? Pdf is so popular (and backed up by gigantic Adobe) that I seriously doubt that. Besides, users of vector programs are so frustrated with all those different formats that they won't want yet another one.

Importing pdf to Draw
What is said above makes this really, really important. It's so important to make it almost mandatory. Postscript and encapsulated postscript (ps and eps) can be converted to pdf quite easily with existing tools (e.g. GSview), and any vector format can be converted to postscript (by printing to file). This means that if Draw could support pdf it could indirectly support about any vector format available!

As DannyB pointed out, all objects in pdf file have natural counterpart in Draw and this makes this the most natural place to import it.

Importing pdf to Writer
You can't import pdf to writer and expect pretty, nicely formatted pages even if there were a converter to do the job. Writers formatting capabilities are not meant to handle floating textboxes, transparency, clipping paths, blends, skewed and rotated objects and such. Not even DTP programs, while far more graphically oriented than word processors, aren't capable of doing such things.

I agree that there could be a converter nonetheless – a nice little one which extracts the text but tries to do about nothing more.

From Writer to pdf it's essentially a one-way trip. From Draw it could be two-way.


Last edited by Graaf on Tue Aug 31, 2004 4:34 pm; edited 1 time in total
Back to top
View user's profile Send private message
xirontask
Power User
Power User


Joined: 25 Aug 2004
Posts: 56

PostPosted: Tue Aug 31, 2004 3:48 pm    Post subject: convert PDF to OOo formats? Reply with quote

In this site
http://site4.pdf995.com/download.html
I read
"The Pdf995 Suite offers the following features, all at no cost:
• Convert PDF to JPEG, TIFF, BMP, PCX formats
• Convert PDF to text
• Convert PDF to HTML and Word DOC conversion"

I have not downloaded this (gratis) program, but maybe it can help?
_________________
Xirontask
Back to top
View user's profile Send private message
Graaf
General User
General User


Joined: 26 Aug 2004
Posts: 33
Location: Finland

PostPosted: Tue Aug 31, 2004 4:10 pm    Post subject: Reply with quote

JPEG, TIFF, BMP and PCX are bitmap formats which means that you not only lose the editing capabilites, but you lose the resolution independence as well. They are useless here.

pdf995's PDF to HTML and Word conversion could be used when fashioning Writers pdf importer if it were open source. Unfortunatelly it's not. Besides, it supports only Windows.

I would turn to pstoedit when searching the solution for PDF -> Draw. It looks very promising.
Back to top
View user's profile Send private message
anvilsoup
Super User
Super User


Joined: 09 Feb 2003
Posts: 606
Location: Australia, mate!

PostPosted: Wed Dec 22, 2004 3:31 am    Post subject: Reply with quote

Actually, if you want to print out a small portion of a PDF, all you need to do is fire up AcrobatReader 5 or 6 (or 7 which is coming out soon apparently), select the area you want to print using the area select tool (on the toolbar, within the text select icon (hold)?) then file > print > Selected Area, then fit to page, then voila!!
Back to top
View user's profile Send private message Visit poster's website
shengchieh
General User
General User


Joined: 13 Dec 2004
Posts: 11

PostPosted: Wed Dec 22, 2004 1:13 pm    Post subject: Reply with quote

You people might be interested in

http://wheel.compose.cs.cmu.edu:8001/cgi-bin/browse/objweb

Perhap converting pdf to MS word, and then, read in Oo might be
possible (although I think the formats are too different).

Sheng-Chieh
Back to top
View user's profile Send private message
Chejkal
Newbie
Newbie


Joined: 24 Feb 2006
Posts: 2
Location: st louis,mo

PostPosted: Mon May 01, 2006 10:19 pm    Post subject: What about conversion from printer format? Reply with quote

Off hand I forget the extension of the files sent to the printer. however IF we could convert from that format, there is a freeware program 'pdf creator' ( I think there are duplicates of that name so check to be sure it IS the freeware version ) that will take ANYthing going to the printer and convert it to pdsf format, and seems to work fairly well too. ( so far )

chuck
Back to top
View user's profile Send private message
9point9
Moderator
Moderator


Joined: 31 Aug 2004
Posts: 3875
Location: UK

PostPosted: Mon May 01, 2006 11:26 pm    Post subject: Reply with quote

Yes, there is PDF Creator but that is for going to PDF and we can already do that in OOo. The question was about importing PDF, something that nothing can really do as PDF is for document description, not text.
_________________
Arch Linux
OOo 3.2.0

OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/
Back to top
View user's profile Send private message Visit poster's website
earonesty
Newbie
Newbie


Joined: 30 Jan 2008
Posts: 1

PostPosted: Wed Jan 30, 2008 9:07 am    Post subject: Re: depends... Reply with quote

DannyB wrote:
Anonymous wrote:
If the pdf file contains text and embedded pictures, the pdf import plugin for koffice under Linux does a good job. For plain text it does anyway;-)

You'd end up with a koffice file that you can export as a RTF file which you could open in OpenOffice. All embedded pictures are saved as png files if I remember right.


It sounds like KOffice imports a PDF as a "word processing" document not as a "drawing" document. I believe that OOo's Draw and NOT Writer is the correct destination for an imported PDF.


That depends on the document. In my daily experience, most PDF's I handle are formatted as "text with embedded images" - and were originally word-processing documents.

Although Draw may be technically correct, there is most probably more demand and usefulness to be derives from a Writer import. Again, this is assuming that most PDF documents that people want to import have their origin in a word processor, and have their editing best-done by Writer. Probably the folks at koffice already came to this conclusion - which is why they import as "Writer" now.

In an ideal scenario, the both Writer and Draw would be able to "open" PDF files, resulting in a "best attempt" import for each - allowing the user to select the tool deemed appropriate.

IMHO - any attempt (even just importing straight text) would be better than the current behavior of just importing the file as binary.
Back to top
View user's profile Send private message AIM Address
huwg
Super User
Super User


Joined: 14 Feb 2007
Posts: 890

PostPosted: Thu Jan 31, 2008 1:18 am    Post subject: Re: depends... Reply with quote

earonesty wrote:
In an ideal scenario, the both Writer and Draw would be able to "open" PDF files, resulting in a "best attempt" import for each - allowing the user to select the tool deemed appropriate.

That looks like the way it is being built:
http://wiki.services.openoffice.org/wiki/Pdf_Import_Extension#PDF_Import_Options_Dialog
Back to top
View user's profile Send private message
AndrewZ
Moderator
Moderator


Joined: 21 Jun 2004
Posts: 4140
Location: Colorado, USA

PostPosted: Wed Jun 04, 2008 5:11 pm    Post subject: Reply with quote

It's ready now. Please try the new experimental release of PDF Import for OpenOffice.org 3.0.
_________________
<signature>
* Did you solve your problem? Do others a favor: Post the solution
* OpenOffice.org Ninja
* BleachBit
</signature>
Back to top
View user's profile Send private message Visit poster's website
Questular
Newbie
Newbie


Joined: 02 Oct 2008
Posts: 1
Location: Belgium

PostPosted: Thu Oct 02, 2008 2:16 am    Post subject: Reply with quote

I tried pdf to pdf/ a conversion with OOo3 and the Sun pdf-import extension.
Command-line: 6 per minute. Great!
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Writer All times are GMT - 8 Hours
Goto page Previous  1, 2, 3, 4
Page 4 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group