| View previous topic :: View next topic |
| Author |
Message |
KirkJobSluder Power User

Joined: 25 Apr 2003 Posts: 73
|
Posted: Sun Jun 22, 2003 10:46 am Post subject: Quickie shell script for extracting writer data |
|
|
I've been procrastinating and fooling around with glimpse today, so I created a shell script one-liner to extract somewhat readable text from OpenOffice writer files for indexing. (Requires sed and unzip on unix, these are probably available for cygwin as well.) Currently, all it does is extract the content.xml (which contains the current contents of the file), adds a linebreak after every tag and spits out the results on standard output.
| Code: |
#!/bin/sh
/usr/local/bin/unzip -p $1 content.xml | sed -e 's/>/>\
/g'
|
Note that there can't be any extra spaces after the backslash in the second line.
to use it:
| Code: |
openOfficeExtractor filename.sxw | less
|
or
| Code: |
openOfficeExtractor filename.sxw > outputfile.xml
|
|
|
| Back to top |
|
 |
KirkJobSluder Power User

Joined: 25 Apr 2003 Posts: 73
|
Posted: Sun Jun 22, 2003 9:40 pm Post subject: Needed to do this anyway. |
|
|
Quick python script for extracting raw text with paragraph breaks from a writer file. I have not tested this with anything that includes graphics yet. The big lesson learned from
this was to turn off validation and external entity lookups. Requires the PyXML library.
| Code: |
#!/usr/local/bin/python
#Import all of the important xml handler functions
from xml.sax.handler import ContentHandler
from xml.sax.handler import feature_namespaces
from xml.sax.handler import feature_validation
from xml.sax.handler import feature_external_ges
from xml.sax.handler import feature_external_pes
import xml.sax
#zipfile is used to extract contents.xml from the
#open office file.
import zipfile
#grab command line options using sys.argv
import sys
#Use StringIO to create a flie-like buffer.
from StringIO import StringIO
class textHandler(ContentHandler):
"""
This class customizes ContentHandler to grab only
the content we want.
"""
def characters(self,ch):
"""Called by the parser to handle anything that is not
inside a tag. We just want to print it out."""
sys.stdout.write(ch.encode("Latin-1",'replace'))
def endElement(self,tag):
"""At the end of every text paragraph and text header,
print a newline."""
if tag == 'text:p':
sys.stdout.write("\n")
if tag == 'text:h':
sys.stdout.write("\n")
else:
return
#Process the script, get filenames from sys.argv
args = sys.argv[1:]
#process every filename
for filename in args:
#decompress content.xml from the filename, then close the
#zipfile handle.
ziphandle = zipfile.ZipFile(filename)
content = ziphandle.read("content.xml")
ziphandle.close()
#create our parser
parser = xml.sax.make_parser()
#use our custom textHandler() to process
#the files.
parser.setContentHandler(textHandler())
#these features turn off validation against the office
#dtd. The script will hurl if these are on.
parser.setFeature(feature_namespaces, 0)
parser.setFeature(feature_validation, 0)
parser.setFeature(feature_external_pes, 0)
parser.setFeature(feature_external_ges, 0)
#finally, parse the content.
parser.parse(StringIO(content))
|
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|