| View previous topic :: View next topic |
| Author |
Message |
DannyB Moderator


Joined: 02 Apr 2003 Posts: 3991 Location: Lawrence, Kansas, USA
|
Posted: Wed Jun 16, 2004 7:11 am Post subject: Re: pdf = vector? |
|
|
| Anonymous wrote: | It seems to me that it is a little misleading to describe PDF as a vector format. Especially since it can contain raster images. A better way to describe it would be as a document format. that's nice and vague.  |
In most vector formats I've ever met (i.e. OOo Draw, MacDraw, QuickDraw Pict, Windows Metafile, Postscript), a raster image is a single "operation" of the vector format. That is, a page may contain any combination of lines, texts, bitmaps, ellipses, polygons, etc. The fact that a document can be made of pages, where each page contains only a single very large raster image does not make it any less a vector format.
For example, I could make an OOo Draw document where each page is just a large bitmap shape. Save it as a Draw document. The document is still fundamentally a vector format even though the bulk of data in this particular example is made up of raster pixels.
In no way do I want to mislead. I'll say it this way. From everything I've read about the pdf format, it seems to me to be a vector format which holds, shapes, lines, texts, raster images, etc.
| Quote: | | What I think would be of most value, as far as a PDF import filter goes, would be to create the filter such that it parses the PDF, identifies raster components and then converts said raster components into a vector format. Then with every component in a vector format apply pattern recognition to ascertain what the equivalent characters, font and whatnot are. |
Converting raster to vector is quite a trick, and subject to much subjective interpretation. One approach is to take a raster, and then "posterize" it somewhat. (See Gimp or Photoshop for posterize operation.) Then turn large blotches of single colors into, say, poly-bezier-gons.
OCR ing images into text is a seperate but related problem.
If I ever, ever, get around to trying to build a PDF import filter, which is a huge potential project, (but a fun one), I would not even try to OCR stuff. I would just be faithfully bringing in a raster (even if it is full page) and placing it faithfully on the Draw page where it goes.
| Quote: | | At least this is the behavior I would expect for an import filter into Writer. For Draw I would expect that all the components would remain in their original format wherever possible though. |
I think you've just hit on why I assert that Draw is the natural import target for PDF, not Writer.
Doing an additional complex OCR operation, to turn some pixels into text would be a very cold additional feature. Even if it were done only in Draw. It may then be possible to place Frames in writer and essentially "lay out" the page into a Writer document. There would be the problem of recognizing that the text happens to naturally fit into the structure of Writer. That is, the main body is paragraphs, perhaps divided into columns, and frames, and text flowing around shapes. Bringing PDF into Writer almost requires PDF->Draw as a prerequisite. Then add the layers of complexity to attempt to properly format it back into Writer.
I understand why people want PDF -> Writer. And it might be possible for a subset of PDF documents. But all PDFs can be faithfully reproduced in Draw, and then turned right back around into PDF again with little or no loss of fidelity.
| Quote: | | On the whole, I question the real value of a PDF import filter, but clearly its desirable since this is not the first thread on the subject. Open source raster to vector tools already exist as do character recognition tools. So its really a matter of pulling it all together into a reasonable implementation. |
If there were a Python pdf parser that parsed a pdf into a large tree of objects, this would be a huge first step. If you read the pdf document specs, you can see that a pdf is made up of tokens. The tokens form some basic data types, such as long, floating point, string, symbol, and dictionary. A number of more complex types (such as a font) are made up out of dictionaries of dictionaries.
At this time, I'm still working past the baby steps of building components in python. That is, I'm still focused on many of the "plumbing" issues. Custom interfaces. How to load libraries in Python (Cybb20, if you're reading this, I have a theory on how to do this.). Multiple services in a single component. And many other basic issues to solve. Right now I'm working on the library loader problem because I have enough code that it is not practical to put it all into one huge text file, as OOo seems to want you to do.
As for value, I would love to have a PDF import, even if were just to Draw. I would love a (general purpose) PDF to Writer, but am not convinced that as many PDF's would work (well) with it. Writer does have a "draw page". If I ever got this far, I could place all the shapes into Writer as floating frames, shapes, bitmaps, etc. What you would end up with would indeed be a Writer doc, but the main text body, headers, footers, etc. would be empty! _________________ Want to make OOo Drawings like the colored flower design to the left? |
|
| Back to top |
|
 |
Graaf General User


Joined: 26 Aug 2004 Posts: 33 Location: Finland
|
Posted: Tue Aug 31, 2004 2:14 pm Post subject: |
|
|
Phew! Quite a thread. I think I can shed some light to this conversation. I'm graphic designer and I've been using vector based drawing (Freehand, Illustrator) and publishing (QuarkXPress, InDesign) softwares for years.
PDF Format
It seems to me that there is a very confused image of pdf's nature in here. DannyB is about the only one who really understands what pdf is all about:
The format is indeed quite “nice and vague”, but essentially vector based. Almost all vector formats supports exactly the same set of features (text, bezier curves, bitmap etc.) as pdf. There is no format that is “pure vector” ie only bezier curves. Even if there were, it wouldn't be practically very useful.
I think it's best to think vector formats as containers which can embed (or in some cases, link) many different kinds of data. Exactly as DannyB said.
PDF vs. SVG and other formats
While originally designed as a rather static format to simply provide standardized means to transfer digital material to the printing houses, the popularity of pdf has soared as it were (an still is) the only truly interchangeable format available. The need to be compatible with printing houses, who adopted pdf eagerly, made it necessary to add the support for it to the vector based programs. Subsequently, it has become one of the most “standard” formats available in these programs. The fact that Adobe made the format "open" is in the key. All other formats are more or less closed and proprietary (many flavors of .ai, .cdr, .fh9, .fh10, fh11 etc.). .pdf is the .doc of graphic industry in terms of acceptance.
Pdf is not the best possible format as it still bears the burdens of its original design: there are still things that pdf does not support or does so poorly. It is however about the best we have.
I understand your enthusiasm about SVG and agree that it has much promise. But how many reputable vector/publishing software currently supports it? I mean those softwares that are used by professionals. One, namely Adobe Illustrator (import only!). Is it going to gain much acceptance in that field? Pdf is so popular (and backed up by gigantic Adobe) that I seriously doubt that. Besides, users of vector programs are so frustrated with all those different formats that they won't want yet another one.
Importing pdf to Draw
What is said above makes this really, really important. It's so important to make it almost mandatory. Postscript and encapsulated postscript (ps and eps) can be converted to pdf quite easily with existing tools (e.g. GSview), and any vector format can be converted to postscript (by printing to file). This means that if Draw could support pdf it could indirectly support about any vector format available!
As DannyB pointed out, all objects in pdf file have natural counterpart in Draw and this makes this the most natural place to import it.
Importing pdf to Writer
You can't import pdf to writer and expect pretty, nicely formatted pages even if there were a converter to do the job. Writers formatting capabilities are not meant to handle floating textboxes, transparency, clipping paths, blends, skewed and rotated objects and such. Not even DTP programs, while far more graphically oriented than word processors, aren't capable of doing such things.
I agree that there could be a converter nonetheless – a nice little one which extracts the text but tries to do about nothing more.
From Writer to pdf it's essentially a one-way trip. From Draw it could be two-way.
Last edited by Graaf on Tue Aug 31, 2004 4:34 pm; edited 1 time in total |
|
| Back to top |
|
 |
xirontask Power User

Joined: 25 Aug 2004 Posts: 56
|
Posted: Tue Aug 31, 2004 3:48 pm Post subject: convert PDF to OOo formats? |
|
|
In this site
http://site4.pdf995.com/download.html
I read
"The Pdf995 Suite offers the following features, all at no cost:
• Convert PDF to JPEG, TIFF, BMP, PCX formats
• Convert PDF to text
• Convert PDF to HTML and Word DOC conversion"
I have not downloaded this (gratis) program, but maybe it can help? _________________ Xirontask |
|
| Back to top |
|
 |
Graaf General User


Joined: 26 Aug 2004 Posts: 33 Location: Finland
|
Posted: Tue Aug 31, 2004 4:10 pm Post subject: |
|
|
JPEG, TIFF, BMP and PCX are bitmap formats which means that you not only lose the editing capabilites, but you lose the resolution independence as well. They are useless here.
pdf995's PDF to HTML and Word conversion could be used when fashioning Writers pdf importer if it were open source. Unfortunatelly it's not. Besides, it supports only Windows.
I would turn to pstoedit when searching the solution for PDF -> Draw. It looks very promising. |
|
| Back to top |
|
 |
anvilsoup Super User


Joined: 09 Feb 2003 Posts: 606 Location: Australia, mate!
|
Posted: Wed Dec 22, 2004 3:31 am Post subject: |
|
|
| Actually, if you want to print out a small portion of a PDF, all you need to do is fire up AcrobatReader 5 or 6 (or 7 which is coming out soon apparently), select the area you want to print using the area select tool (on the toolbar, within the text select icon (hold)?) then file > print > Selected Area, then fit to page, then voila!! |
|
| Back to top |
|
 |
shengchieh General User

Joined: 13 Dec 2004 Posts: 11
|
|
| Back to top |
|
 |
Chejkal Newbie

Joined: 24 Feb 2006 Posts: 2 Location: st louis,mo
|
Posted: Mon May 01, 2006 10:19 pm Post subject: What about conversion from printer format? |
|
|
Off hand I forget the extension of the files sent to the printer. however IF we could convert from that format, there is a freeware program 'pdf creator' ( I think there are duplicates of that name so check to be sure it IS the freeware version ) that will take ANYthing going to the printer and convert it to pdsf format, and seems to work fairly well too. ( so far )
chuck |
|
| Back to top |
|
 |
9point9 Moderator

Joined: 31 Aug 2004 Posts: 3875 Location: UK
|
Posted: Mon May 01, 2006 11:26 pm Post subject: |
|
|
Yes, there is PDF Creator but that is for going to PDF and we can already do that in OOo. The question was about importing PDF, something that nothing can really do as PDF is for document description, not text. _________________ Arch Linux
OOo 3.2.0
OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/ |
|
| Back to top |
|
 |
earonesty Newbie

Joined: 30 Jan 2008 Posts: 1
|
Posted: Wed Jan 30, 2008 9:07 am Post subject: Re: depends... |
|
|
| DannyB wrote: | | Anonymous wrote: | If the pdf file contains text and embedded pictures, the pdf import plugin for koffice under Linux does a good job. For plain text it does anyway;-)
You'd end up with a koffice file that you can export as a RTF file which you could open in OpenOffice. All embedded pictures are saved as png files if I remember right. |
It sounds like KOffice imports a PDF as a "word processing" document not as a "drawing" document. I believe that OOo's Draw and NOT Writer is the correct destination for an imported PDF.
|
That depends on the document. In my daily experience, most PDF's I handle are formatted as "text with embedded images" - and were originally word-processing documents.
Although Draw may be technically correct, there is most probably more demand and usefulness to be derives from a Writer import. Again, this is assuming that most PDF documents that people want to import have their origin in a word processor, and have their editing best-done by Writer. Probably the folks at koffice already came to this conclusion - which is why they import as "Writer" now.
In an ideal scenario, the both Writer and Draw would be able to "open" PDF files, resulting in a "best attempt" import for each - allowing the user to select the tool deemed appropriate.
IMHO - any attempt (even just importing straight text) would be better than the current behavior of just importing the file as binary. |
|
| Back to top |
|
 |
huwg Super User

Joined: 14 Feb 2007 Posts: 890
|
|
| Back to top |
|
 |
AndrewZ Moderator


Joined: 21 Jun 2004 Posts: 4140 Location: Colorado, USA
|
|
| Back to top |
|
 |
Questular Newbie

Joined: 02 Oct 2008 Posts: 1 Location: Belgium
|
Posted: Thu Oct 02, 2008 2:16 am Post subject: |
|
|
I tried pdf to pdf/ a conversion with OOo3 and the Sun pdf-import extension.
Command-line: 6 per minute. Great! |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|