OpenOffice.org Forum at OOoForum.orgThe OpenOffice.org Forum
 
 [Home]   [FAQ]   [Search]   [Memberlist]   [Usergroups]   [Register
 [Profile]   [Log in to check your private messages]   [Log in

Linux shell script to reduce OpenDocument file sizes

 
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Code Snippets
View previous topic :: View next topic  
Author Message
9point9
Moderator
Moderator


Joined: 31 Aug 2004
Posts: 3875
Location: UK

PostPosted: Mon Nov 21, 2005 1:17 pm    Post subject: Linux shell script to reduce OpenDocument file sizes Reply with quote

In this previous thread I discussed some methods of reducing the size of OpenDocument files losslessly:
http://www.oooforum.org/forum/viewtopic.phtml?t=27339

This is a Linux shell script I have written to automate this:
Code:
#!/bin/bash
#######################################################################
# Author: Edward Holness                                              #
# Date: 21/11/2005                                                    #
# ScriptName: optimOD (Optimise OpenDocument)                         #
# Description:   A script that optimises OpenDocument (and *.sx* files,#
#       with modicication) losslessy.  It unzips the files,   #
#      optimises PNG and JPEG images, then recompresses using#
#      higher compression yet still being compatible with    #
#      OpenDocument.                                         #
# Requirements:   zip, unzip, pngcrush and jpegoptim in your path       #
# Command line instructions: sh optimod <filename>                    #
# Known bugs:                                                         #
#######################################################################

if test -f $1       # test that file exists
   then echo $1 found      # confirmation of file existence
else
   echo no file to optimise, exiting nicely.   # warn that file doesn't exist and exit
   exit
fi

unzip -oq $1 -d optimod_$1   # Extract the file
echo $1 extracted to optimod_$1

#### Pictures

mkdir optimod_$1/Pictures/temp   # Make the pictures temp directory

pngcrush -q -d optimod_$1/Pictures/temp  -brute optimod_$1/Pictures/*.png   # Crush the PNG's
echo PNG graphics optimised

mv optimod_$1/Pictures/temp/*.png optimod_$1/Pictures/   # Move the crushed PNG's on top of the originals

jpegoptim --quiet optimod_$1/Pictures/*.jpg   # Optimise JPEG's losslessly
echo JPEG graphics optimised

rmdir optimod_$1/Pictures/temp/   # Remove temp dirctory

#### Thumbnails, comment out for *.sx*

mkdir optimod_$1/Thumbnails/temp   # Make the thumbnails temp directory

pngcrush -q -d optimod_$1/Thumbnails/temp  -brute optimod_$1/Thumbnails/*.png   # Crush the PNG's
echo PNG Thumbnails optimised

mv optimod_$1/Thumbnails/temp/*.png optimod_$1/Thumbnails/   # Move the crushed PNG's on top of the originals

rmdir optimod_$1/Thumbnails/temp/

#### Compress
cd optimod_$1/
zip -rq9 ../$1 *

#### Clear up the junk
rm -rf *
cd ../
rmdir optimod_$1/


I've got about 22% file size reduction from it. If you save the file again you will lose the ZIP compression but maintain the image compression unless you modify the images too. It should work on all OpenDocument files and will maintain the file structure. It can also work on *.sx* files if you comment out the bit about Thumbnails.

It required ZIP, pngcrush and jpegoptim. These are freely available and may already be on your system, most distros come with them as options.

On my Athlon XP 2400+ it takes about 2 minutes to run on a big file. It can be greatly sped up by removing the -brute pngcrush'ing without much loss of compression but this is about minimum file size, not speed.

This is not intended as a tool to use on any file. It would be useful if you want to email or post a file on the web and want to save on bandwidth. It is entirely lossless but do please make backups before something goes wrong (I've not broken anything yet).

My shell scripting isn't up to much so it won't batch process. If anyone has any experience in this I'd be happy to find out. Any discussion of improving methods would be better in this thread:
http://www.oooforum.org/forum/viewtopic.phtml?t=27339
_________________
Arch Linux
OOo 3.2.0

OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/
Back to top
View user's profile Send private message Visit poster's website
AndrewZ
Moderator
Moderator


Joined: 21 Jun 2004
Posts: 4140
Location: Colorado, USA

PostPosted: Mon Nov 21, 2005 1:38 pm    Post subject: Reply with quote

Though images usually account for most space, you could reduce some space by compacting the XML For example, you could remove all remove redundant namespace declarations and unnecesary whitespace.
Back to top
View user's profile Send private message Visit poster's website
9point9
Moderator
Moderator


Joined: 31 Aug 2004
Posts: 3875
Location: UK

PostPosted: Mon Nov 21, 2005 1:54 pm    Post subject: Reply with quote

ahz wrote:
you could reduce some space by compacting the XML

This is something for which I would have to learn the OpenDocument specification before attempting. The last thing I'd want to do is break ODF. It is not just for whether a file works in OOo, it would have to work in any OpenDocument program. I will give this a go in future.

I also think that the Configurations2 and Thumnails directories could be removed but that might break ODF.
_________________
Arch Linux
OOo 3.2.0

OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/
Back to top
View user's profile Send private message Visit poster's website
AndrewZ
Moderator
Moderator


Joined: 21 Jun 2004
Posts: 4140
Location: Colorado, USA

PostPosted: Mon Nov 21, 2005 2:16 pm    Post subject: Reply with quote

9point9 wrote:
ahz wrote:
you could reduce some space by compacting the XML

This is something for which I would have to learn the OpenDocument specification before attempting.


You don't need to: it's just a superficial XML operation.

Quote:
The last thing I'd want to do is break ODF. It is not just for whether a file works in OOo, it would have to work in any OpenDocument program.


The meaning of the XML will not change any more than a PNG or JPEG after their respective optimizations. Because all these changes are superficial, other programs should be able to read the documents too.

Quote:
I will give this a go in future.

There are probably many ways to do it. I have a program called xml (XMLStarlet Toolkit), and this command should reduce the file size:

Code:
xml fo --noindent --nsclean filename.xml


The above operation probably doesn't remove the end of line markers (which are not required).
Back to top
View user's profile Send private message Visit poster's website
9point9
Moderator
Moderator


Joined: 31 Aug 2004
Posts: 3875
Location: UK

PostPosted: Mon Nov 21, 2005 3:02 pm    Post subject: Reply with quote

I thought you meant editing the files, that would be too complicated! Confused

I've had a go with an xml tool called xmllint. It also uses --nsclean but does'nt use --noindent, I think that is automatic. It is only useful on content.xml, the other files would be the same size if it wasn't for the extra carriage return that it puts at the end of the file.

Using it just on content.xml (and not getting rid of the extra carriage return) and sticking it in my previous script brings the file size down to 677587 from 677642 before.
_________________
Arch Linux
OOo 3.2.0

OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/
Back to top
View user's profile Send private message Visit poster's website
AndrewZ
Moderator
Moderator


Joined: 21 Jun 2004
Posts: 4140
Location: Colorado, USA

PostPosted: Mon Nov 21, 2005 3:16 pm    Post subject: Reply with quote

9point9 wrote:
Using it just on content.xml (and not getting rid of the extra carriage return) and sticking it in my previous script brings the file size down to 677587 from 677642 before.


Looking at context.xml, it appears to be pretty well compressed already: there is only one carriage return in the whole file, and I don't see any extra whitespace. So files produced by OpenOffice.org will not benefit from this procedure.
Back to top
View user's profile Send private message Visit poster's website
9point9
Moderator
Moderator


Joined: 31 Aug 2004
Posts: 3875
Location: UK

PostPosted: Tue Nov 22, 2005 8:15 am    Post subject: Reply with quote

I've made quite a few changes to the script:
Code:
#!/bin/bash
#######################################################################
# Author: Edward Holness                                              #
# Date: 22/11/2005                                                    #
# ScriptName: optimOD (Optimise OpenDocument)                         #
# Description:   A script that optimises OpenDocument (and *.sx* files,#
#       with modicication) losslessy.  It unzips the files,   #
#      optimises PNG and JPEG images, then recompresses using#
#      higher compression yet still being compatible with    #
#      OpenDocument.                                         #
# Requirements:   zip, unzip, pngcrush, jpegoptim and xmllint in your   #
#      path.                                                 #
# Command line instructions: sh optimod <filename>                    #
# Known bugs:                                                         #
#######################################################################

if test -f $1       # test that file exists
   then echo $1 found      # confirmation of file existence
else
   echo no file to optimise, exiting nicely.   # warn that file doesn't exist and exit
   exit
fi

unzip -oq $1 -d optimod_$1   # Extract the file
echo $1 extracted to optimod_$1

#### Pictures

mkdir optimod_$1/Pictures/temp   # Make the pictures temp directory

pngcrush -q -d optimod_$1/Pictures/temp  -brute optimod_$1/Pictures/*.png   # Crush the PNG's
echo PNG graphics optimised

mv optimod_$1/Pictures/temp/*.png optimod_$1/Pictures/   # Move the crushed PNG's on top of the originals

jpegoptim --quiet optimod_$1/Pictures/*.jpg   # Optimise JPEG's losslessly
echo JPEG graphics optimised

rmdir optimod_$1/Pictures/temp/   # Remove temp dirctory

#### Thumbnails, comment out for *.sx*

mkdir optimod_$1/Thumbnails/temp   # Make the thumbnails temp directory

pngcrush -q -d optimod_$1/Thumbnails/temp  -brute optimod_$1/Thumbnails/*.png   # Crush the PNG's
echo PNG Thumbnails optimised

mv optimod_$1/Thumbnails/temp/*.png optimod_$1/Thumbnails/   # Move the crushed PNG's on top of the originals

rmdir optimod_$1/Thumbnails/temp/


#### reduce contents, comment out for original content
rmdir optimod_$1/Configurations2/
rm optimod_$1/layout-cache
cd optimod_$1/
rm -rf Thumbnails/
cd ../

#### content.xml
xmllint --nsclean optimod_$1/content.xml > optimod_$1/content.xml.lint
sed '/^ *$/d' optimod_$1/content.xml.lint > optimod_$1/content.xml   # Remove blank lines
rm optimod_$1/content.xml.lint


#### Compress
cd optimod_$1/
zip -rq9 ../$1.new *
mv ../$1.new ../$1

#### Clear up the junk
rm -rf *
cd ../
rmdir optimod_$1/


1. It now uses xmllint to compact content.xml, then sed to remove the line break that xmllint inserts. Thanks to ahz for his suggestion. The other xml files don't seem to need compaction but if I could do intelligent comparison of file sizes then I would make them compactable.

2. It removes Configurations2/. This doesn't seem necessary as it's empty and is recreated on modifying the file.

3. It removes layout-cache. This is recreated on modification. It just slows down the file open process a bit.

4. Removes Thumbnails/. This is a tricky one as it's specified in OpenDocument but works OK. I always use a detailed listing in any file manager so can't see the use of thumbnails myself. It will be recreated on editing. I've left in the code for optimising Thumbnails/ just in case you don't want it removed.

This script now reduces the OOo setup guide 1.23 from 868405 to 675327. That's 22.23%.
_________________
Arch Linux
OOo 3.2.0

OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/
Back to top
View user's profile Send private message Visit poster's website
AndrewZ
Moderator
Moderator


Joined: 21 Jun 2004
Posts: 4140
Location: Colorado, USA

PostPosted: Tue Nov 22, 2005 10:09 am    Post subject: Reply with quote

I do like your script: I plan to try the JPEG optimizations (nice idea) against one of my web sites. But I have two more thoughts about your script:

First, there is an option in OpenOffice.org for optimizing XML, and it seems to be on by default.

Second, the script may cause data loss in some situations, and this problem could be avoided by checking that the tools exist and that no error has occured. For example, in the following code:

Code:
xmllint --nsclean optimod_$1/content.xml > optimod_$1/content.xml.lint
sed '/^ *$/d' optimod_$1/content.xml.lint > optimod_$1/content.xml   


Several things could happen:
1. xmllint does not exist
2. sed does not exist
3. xmllint has a runtime error
4. sed has a runtime error

The runtime errors are probably rare, but perhaps someone would run the script without some of the required tools.
Back to top
View user's profile Send private message Visit poster's website
9point9
Moderator
Moderator


Joined: 31 Aug 2004
Posts: 3875
Location: UK

PostPosted: Tue Nov 22, 2005 1:49 pm    Post subject: Reply with quote

ahz wrote:
I plan to try the JPEG optimizations (nice idea) against one of my web sites.

Think of all the bandwidth you could save yourself! It's amazing how many poorly compressed images there are out there. jpegoptim does seem a bit hit and miss. It will sometimes knock off 15% from a JPEG but sometimes it gives no improvement. pngcrush gives a more dependable output. I have not yet found it not compacting a file and sometimes it gives over 40%. This is more because of the difference in compression methods between PNG and JPEG. It does also depend upon the program used to produce the image. Some are better than others.

ahz wrote:
First, there is an option in OpenOffice.org for optimizing XML, and it seems to be on by default.

My experiments seem to show that it's not all that well optimised. It doesn't say what it's optimised for. Whatever, we can not expect it to compact better than a dedicated program.


ahz wrote:
Second, the script may cause data loss in some situations, and this problem could be avoided by checking that the tools exist and that no error has occured.

The script is really stupid. I am assuming people to check and make sure that they have these things and make a backup copy before running. Making it more intelligent is something I will do. This is my first real shell script in 7 months, my last, well, that was my first. I'm learning as I go along.

If the outputted file is openable in OOo and looks like the original then it will almost certainly have worked.
_________________
Arch Linux
OOo 3.2.0

OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/
Back to top
View user's profile Send private message Visit poster's website
9point9
Moderator
Moderator


Joined: 31 Aug 2004
Posts: 3875
Location: UK

PostPosted: Thu Dec 01, 2005 1:10 pm    Post subject: Reply with quote

I've made a whole load of minor changes which may reduce file sizes more:

- Used the '-reduce' switch with pngcrush. This losslessly reduces the bit depth of an image whose content could be expressed using a smaller pallette. This will rarely be applied.

- Used the '-rem alla' pngcrush switch. This will remove all metadata from the PNG images. As the metadata of the document is important, not the image, most people don't set the meta-data and there is no real use for it in a document I think it can be safely taken out to save space.

- Used the '--strip-all' jpegoptim switch. This also removes metadata tags to save space, again, I don't think they're that useful.

- Removed the bit about optimising thumbnails as they get deleted later on anyway. If you want them back, my older code does it.

- Got rid of the sed bit. It didn't do anything as xmllint did the same and there's still an empty line at the end.

Running this on the original test file brings it down to 674540 bytes. That's 22.32% compaction. This is only on a particular file, on one of my own I have managed to save 41.45%! I'd be keen to know if any one can beat this!

Code:
#!/bin/bash
#######################################################################
# Author: Edward Holness                                              #
# Date: 01/12/2005                                                    #
# ScriptName: optimOD (Optimise OpenDocument)                         #
# Description:   A script that optimises OpenDocument (and *.sx* files,#
#       with modicication) losslessy.  It unzips the files,   #
#      optimises PNG and JPEG images, removes unneccessary   #
#      contents,then recompresses using higher compression   #
#      yet still being compatible with OpenDocument.         #
# Requirements:   zip, unzip, pngcrush, jpegoptim and xmllint in your   #
#      path.                                                 #
# Command line instructions: sh optimod <filename> or add to your path#
# Known bugs:                                                         #
#######################################################################

if test -f $1       # test that file exists
   then echo $1 found      # confirmation of file existence
else
   echo no file to optimise, exiting nicely.   # warn that file doesn't exist and exit
   exit
fi

unzip -oq $1 -d optimod_$1   # Extract the file
echo $1 extracted to optimod_$1

#### Pictures
mkdir optimod_$1/Pictures/temp   # Make the pictures temp directory

pngcrush -q -d optimod_$1/Pictures/temp  -brute -reduce -rem alla optimod_$1/Pictures/*.png   # Crush the PNG's
echo PNG graphics optimised

mv optimod_$1/Pictures/temp/*.png optimod_$1/Pictures/   # Move the crushed PNG's on top of the originals

jpegoptim --strip-all --quiet optimod_$1/Pictures/*.jpg   # Optimise JPEG's losslessly
echo JPEG graphics optimised

rmdir optimod_$1/Pictures/temp/   # Remove temp dirctory

#### reduce contents, comment out for original content
rmdir optimod_$1/Configurations2/
echo Removed Configurations2/
rm optimod_$1/layout-cache
echo Removed layout-cache
cd optimod_$1/
rm -rf Thumbnails/
echo Removed Thumbnails/
cd ../

#### content.xml
xmllint --nsclean optimod_$1/content.xml > optimod_$1/content.xml.lint
mv optimod_$1/content.xml.lint optimod_$1/content.xml
echo Compacted content.xml

#### Compress
cd optimod_$1/
zip -rq9 ../$1.new *
mv ../$1.new ../$1
echo Compressed

#### Clear up the junk
rm -rf *
cd ../
rmdir optimod_$1/

_________________
Arch Linux
OOo 3.2.0

OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/
Back to top
View user's profile Send private message Visit poster's website
9point9
Moderator
Moderator


Joined: 31 Aug 2004
Posts: 3875
Location: UK

PostPosted: Sun Mar 12, 2006 6:44 am    Post subject: Reply with quote

I've changed the script to use OptiPNG instead of PNGcrush. It currently uses 1080 runs. This is more than is recommended but gives 3Kb better than with only 240 runs. It now takes a lot longer but does get the files smaller.

I hope to incorporate brute-forcing of zlib window size. That should make a factor of 14 slower but there has got to be a reason to have an Athlon 64 FX-60.

OptiPNG is better than PNGcrush so you could always reduce the number of runs and end up with something faster.

This was doen with OptiPNG 0.5 which doesn't seem to include everything that PNGCrush does yet. I would expect it to get better with later versions.

This can reduce the setup guide draft 1 used above to 669518. That's 22.9%.

Code:
#!/bin/bash
#######################################################################
# Author: Edward Holness                                              #
# Date: 2006-03-12                                                    #
# ScriptName: optimOD (Optimise OpenDocument)                         #
# Description:   A script that optimises OpenDocument (and *.sx* files,#
#       with modification) losslessly.  It unzips the files,  #
#      optimises PNG and JPEG images, removes unneccessary   #
#      contents,then recompresses using higher compression   #
#      yet still being compatible with OpenDocument.         #
# Requirements:   zip, unzip, optipng, jpegoptim and xmllint in your    #
#      path.                                                 #
# Command line instructions: sh optimod <filename> or add to your path#
# Known bugs:                                                         #
#######################################################################

if test -f $1       # test that file exists
   then echo $1 found      # confirmation of file existence
else
   echo no file to optimise, exiting nicely.   # warn that file doesn't exist and exit
   exit
fi

unzip -oq $1 -d optimod_$1   # Extract the file
echo $1 extracted to optimod_$1

#### Pictures
optipng -zc1-9 -zm1-9 -zs0-3 -f0-5 optimod_$1/Pictures/*.png

echo PNG graphics optimised

jpegoptim --strip-all --quiet optimod_$1/Pictures/*.jpg   # Optimise JPEG's losslessly
echo JPEG graphics optimised

#### reduce contents, comment out for original content
rmdir optimod_$1/Configurations2/
echo Removed Configurations2/
rm optimod_$1/layout-cache
echo Removed layout-cache
cd optimod_$1/
rm -rf Thumbnails/
echo Removed Thumbnails/
cd ../

#### content.xml
xmllint --nsclean optimod_$1/content.xml > optimod_$1/content.xml.lint
mv optimod_$1/content.xml.lint optimod_$1/content.xml
echo Compacted content.xml

#### Compress
cd optimod_$1/
zip -rq9 ../$1.new *
mv ../$1.new ../$1
echo Compressed

#### Clear up the junk
rm -rf *
cd ../
rmdir optimod_$1/

_________________
Arch Linux
OOo 3.2.0

OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/
Back to top
View user's profile Send private message Visit poster's website
9point9
Moderator
Moderator


Joined: 31 Aug 2004
Posts: 3875
Location: UK

PostPosted: Sun Mar 12, 2006 1:43 pm    Post subject: Reply with quote

Implementing advancedCOMP gives some improvement:
Code:
advpng -z4 optimod_$1/Pictures/*.png
advzip ../$1 -z4


The setup guide file now comes to 661082 bytes.

Low level modification could also be used to repair corrupt OpenDocument files. I'm think ing about this now and how to implement it.
_________________
Arch Linux
OOo 3.2.0

OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Code Snippets All times are GMT - 8 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group