OpenOffice.org Forum at OOoForum.orgThe OpenOffice.org Forum
 
 [Home]   [FAQ]   [Search]   [Memberlist]   [Usergroups]   [Register
 [Profile]   [Log in to check your private messages]   [Log in

Quickie shell script for extracting writer data

 
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Writer
View previous topic :: View next topic  
Author Message
KirkJobSluder
Power User
Power User


Joined: 25 Apr 2003
Posts: 73

PostPosted: Sun Jun 22, 2003 10:46 am    Post subject: Quickie shell script for extracting writer data Reply with quote

I've been procrastinating and fooling around with glimpse today, so I created a shell script one-liner to extract somewhat readable text from OpenOffice writer files for indexing. (Requires sed and unzip on unix, these are probably available for cygwin as well.) Currently, all it does is extract the content.xml (which contains the current contents of the file), adds a linebreak after every tag and spits out the results on standard output.

Code:

#!/bin/sh
/usr/local/bin/unzip -p $1 content.xml | sed -e 's/>/>\
/g'


Note that there can't be any extra spaces after the backslash in the second line.

to use it:
Code:

openOfficeExtractor filename.sxw | less


or
Code:

openOfficeExtractor filename.sxw > outputfile.xml
Back to top
View user's profile Send private message
KirkJobSluder
Power User
Power User


Joined: 25 Apr 2003
Posts: 73

PostPosted: Sun Jun 22, 2003 9:40 pm    Post subject: Needed to do this anyway. Reply with quote

Quick python script for extracting raw text with paragraph breaks from a writer file. I have not tested this with anything that includes graphics yet. The big lesson learned from
this was to turn off validation and external entity lookups. Requires the PyXML library.

Code:

#!/usr/local/bin/python


#Import all of the important xml handler functions
from xml.sax.handler import ContentHandler
from xml.sax.handler import feature_namespaces
from xml.sax.handler import feature_validation
from xml.sax.handler import feature_external_ges
from xml.sax.handler import feature_external_pes

import xml.sax

#zipfile is used to extract contents.xml from the
#open office file.
import zipfile

#grab command line options using sys.argv
import sys

#Use StringIO to create a flie-like buffer.
from StringIO import StringIO


class textHandler(ContentHandler):
    """
    This class customizes ContentHandler to grab only
    the content we want.
    """

    def characters(self,ch):
        """Called by the parser to handle anything that is not
        inside a tag.  We just want to print it out."""
        sys.stdout.write(ch.encode("Latin-1",'replace'))

    def endElement(self,tag):
        """At the end of every text paragraph and text header,
        print a newline."""
        if tag == 'text:p':
            sys.stdout.write("\n")
        if tag == 'text:h':
            sys.stdout.write("\n")
        else:
            return

#Process the script, get filenames from sys.argv
args = sys.argv[1:]

#process every filename
for filename in args:

    #decompress content.xml from the filename, then close the
    #zipfile handle.
    ziphandle = zipfile.ZipFile(filename)
    content = ziphandle.read("content.xml")
    ziphandle.close()

    #create our parser
    parser = xml.sax.make_parser()

    #use our custom textHandler() to process
    #the files.
    parser.setContentHandler(textHandler())

    #these features turn off validation against the office
    #dtd.  The script will hurl if these are on.
    parser.setFeature(feature_namespaces, 0)
    parser.setFeature(feature_validation, 0)
    parser.setFeature(feature_external_pes, 0)
    parser.setFeature(feature_external_ges, 0)

    #finally, parse the content.
    parser.parse(StringIO(content))
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Writer All times are GMT - 8 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group