| View previous topic :: View next topic |
| Author |
Message |
9point9 Moderator

Joined: 31 Aug 2004 Posts: 3904 Location: UK
|
Posted: Mon Nov 21, 2005 1:17 pm Post subject: Linux shell script to reduce OpenDocument file sizes |
|
|
In this previous thread I discussed some methods of reducing the size of OpenDocument files losslessly:
http://www.oooforum.org/forum/viewtopic.phtml?t=27339
This is a Linux shell script I have written to automate this:
| Code: | #!/bin/bash
#######################################################################
# Author: Edward Holness #
# Date: 21/11/2005 #
# ScriptName: optimOD (Optimise OpenDocument) #
# Description: A script that optimises OpenDocument (and *.sx* files,#
# with modicication) losslessy. It unzips the files, #
# optimises PNG and JPEG images, then recompresses using#
# higher compression yet still being compatible with #
# OpenDocument. #
# Requirements: zip, unzip, pngcrush and jpegoptim in your path #
# Command line instructions: sh optimod <filename> #
# Known bugs: #
#######################################################################
if test -f $1 # test that file exists
then echo $1 found # confirmation of file existence
else
echo no file to optimise, exiting nicely. # warn that file doesn't exist and exit
exit
fi
unzip -oq $1 -d optimod_$1 # Extract the file
echo $1 extracted to optimod_$1
#### Pictures
mkdir optimod_$1/Pictures/temp # Make the pictures temp directory
pngcrush -q -d optimod_$1/Pictures/temp -brute optimod_$1/Pictures/*.png # Crush the PNG's
echo PNG graphics optimised
mv optimod_$1/Pictures/temp/*.png optimod_$1/Pictures/ # Move the crushed PNG's on top of the originals
jpegoptim --quiet optimod_$1/Pictures/*.jpg # Optimise JPEG's losslessly
echo JPEG graphics optimised
rmdir optimod_$1/Pictures/temp/ # Remove temp dirctory
#### Thumbnails, comment out for *.sx*
mkdir optimod_$1/Thumbnails/temp # Make the thumbnails temp directory
pngcrush -q -d optimod_$1/Thumbnails/temp -brute optimod_$1/Thumbnails/*.png # Crush the PNG's
echo PNG Thumbnails optimised
mv optimod_$1/Thumbnails/temp/*.png optimod_$1/Thumbnails/ # Move the crushed PNG's on top of the originals
rmdir optimod_$1/Thumbnails/temp/
#### Compress
cd optimod_$1/
zip -rq9 ../$1 *
#### Clear up the junk
rm -rf *
cd ../
rmdir optimod_$1/ |
I've got about 22% file size reduction from it. If you save the file again you will lose the ZIP compression but maintain the image compression unless you modify the images too. It should work on all OpenDocument files and will maintain the file structure. It can also work on *.sx* files if you comment out the bit about Thumbnails.
It required ZIP, pngcrush and jpegoptim. These are freely available and may already be on your system, most distros come with them as options.
On my Athlon XP 2400+ it takes about 2 minutes to run on a big file. It can be greatly sped up by removing the -brute pngcrush'ing without much loss of compression but this is about minimum file size, not speed.
This is not intended as a tool to use on any file. It would be useful if you want to email or post a file on the web and want to save on bandwidth. It is entirely lossless but do please make backups before something goes wrong (I've not broken anything yet).
My shell scripting isn't up to much so it won't batch process. If anyone has any experience in this I'd be happy to find out. Any discussion of improving methods would be better in this thread:
http://www.oooforum.org/forum/viewtopic.phtml?t=27339 _________________ Arch Linux
OOo 3.2.0
OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/ |
|
| Back to top |
|
 |
AndrewZ Moderator


Joined: 21 Jun 2004 Posts: 4148 Location: Colorado, USA
|
Posted: Mon Nov 21, 2005 1:38 pm Post subject: |
|
|
| Though images usually account for most space, you could reduce some space by compacting the XML For example, you could remove all remove redundant namespace declarations and unnecesary whitespace. |
|
| Back to top |
|
 |
9point9 Moderator

Joined: 31 Aug 2004 Posts: 3904 Location: UK
|
Posted: Mon Nov 21, 2005 1:54 pm Post subject: |
|
|
| ahz wrote: | | you could reduce some space by compacting the XML |
This is something for which I would have to learn the OpenDocument specification before attempting. The last thing I'd want to do is break ODF. It is not just for whether a file works in OOo, it would have to work in any OpenDocument program. I will give this a go in future.
I also think that the Configurations2 and Thumnails directories could be removed but that might break ODF. _________________ Arch Linux
OOo 3.2.0
OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/ |
|
| Back to top |
|
 |
AndrewZ Moderator


Joined: 21 Jun 2004 Posts: 4148 Location: Colorado, USA
|
Posted: Mon Nov 21, 2005 2:16 pm Post subject: |
|
|
| 9point9 wrote: | | ahz wrote: | | you could reduce some space by compacting the XML |
This is something for which I would have to learn the OpenDocument specification before attempting. |
You don't need to: it's just a superficial XML operation.
| Quote: | | The last thing I'd want to do is break ODF. It is not just for whether a file works in OOo, it would have to work in any OpenDocument program. |
The meaning of the XML will not change any more than a PNG or JPEG after their respective optimizations. Because all these changes are superficial, other programs should be able to read the documents too.
| Quote: | | I will give this a go in future. |
There are probably many ways to do it. I have a program called xml (XMLStarlet Toolkit), and this command should reduce the file size:
| Code: | | xml fo --noindent --nsclean filename.xml |
The above operation probably doesn't remove the end of line markers (which are not required). |
|
| Back to top |
|
 |
9point9 Moderator

Joined: 31 Aug 2004 Posts: 3904 Location: UK
|
Posted: Mon Nov 21, 2005 3:02 pm Post subject: |
|
|
I thought you meant editing the files, that would be too complicated!
I've had a go with an xml tool called xmllint. It also uses --nsclean but does'nt use --noindent, I think that is automatic. It is only useful on content.xml, the other files would be the same size if it wasn't for the extra carriage return that it puts at the end of the file.
Using it just on content.xml (and not getting rid of the extra carriage return) and sticking it in my previous script brings the file size down to 677587 from 677642 before. _________________ Arch Linux
OOo 3.2.0
OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/ |
|
| Back to top |
|
 |
AndrewZ Moderator


Joined: 21 Jun 2004 Posts: 4148 Location: Colorado, USA
|
Posted: Mon Nov 21, 2005 3:16 pm Post subject: |
|
|
| 9point9 wrote: | | Using it just on content.xml (and not getting rid of the extra carriage return) and sticking it in my previous script brings the file size down to 677587 from 677642 before. |
Looking at context.xml, it appears to be pretty well compressed already: there is only one carriage return in the whole file, and I don't see any extra whitespace. So files produced by OpenOffice.org will not benefit from this procedure. |
|
| Back to top |
|
 |
9point9 Moderator

Joined: 31 Aug 2004 Posts: 3904 Location: UK
|
Posted: Tue Nov 22, 2005 8:15 am Post subject: |
|
|
I've made quite a few changes to the script:
| Code: | #!/bin/bash
#######################################################################
# Author: Edward Holness #
# Date: 22/11/2005 #
# ScriptName: optimOD (Optimise OpenDocument) #
# Description: A script that optimises OpenDocument (and *.sx* files,#
# with modicication) losslessy. It unzips the files, #
# optimises PNG and JPEG images, then recompresses using#
# higher compression yet still being compatible with #
# OpenDocument. #
# Requirements: zip, unzip, pngcrush, jpegoptim and xmllint in your #
# path. #
# Command line instructions: sh optimod <filename> #
# Known bugs: #
#######################################################################
if test -f $1 # test that file exists
then echo $1 found # confirmation of file existence
else
echo no file to optimise, exiting nicely. # warn that file doesn't exist and exit
exit
fi
unzip -oq $1 -d optimod_$1 # Extract the file
echo $1 extracted to optimod_$1
#### Pictures
mkdir optimod_$1/Pictures/temp # Make the pictures temp directory
pngcrush -q -d optimod_$1/Pictures/temp -brute optimod_$1/Pictures/*.png # Crush the PNG's
echo PNG graphics optimised
mv optimod_$1/Pictures/temp/*.png optimod_$1/Pictures/ # Move the crushed PNG's on top of the originals
jpegoptim --quiet optimod_$1/Pictures/*.jpg # Optimise JPEG's losslessly
echo JPEG graphics optimised
rmdir optimod_$1/Pictures/temp/ # Remove temp dirctory
#### Thumbnails, comment out for *.sx*
mkdir optimod_$1/Thumbnails/temp # Make the thumbnails temp directory
pngcrush -q -d optimod_$1/Thumbnails/temp -brute optimod_$1/Thumbnails/*.png # Crush the PNG's
echo PNG Thumbnails optimised
mv optimod_$1/Thumbnails/temp/*.png optimod_$1/Thumbnails/ # Move the crushed PNG's on top of the originals
rmdir optimod_$1/Thumbnails/temp/
#### reduce contents, comment out for original content
rmdir optimod_$1/Configurations2/
rm optimod_$1/layout-cache
cd optimod_$1/
rm -rf Thumbnails/
cd ../
#### content.xml
xmllint --nsclean optimod_$1/content.xml > optimod_$1/content.xml.lint
sed '/^ *$/d' optimod_$1/content.xml.lint > optimod_$1/content.xml # Remove blank lines
rm optimod_$1/content.xml.lint
#### Compress
cd optimod_$1/
zip -rq9 ../$1.new *
mv ../$1.new ../$1
#### Clear up the junk
rm -rf *
cd ../
rmdir optimod_$1/ |
1. It now uses xmllint to compact content.xml, then sed to remove the line break that xmllint inserts. Thanks to ahz for his suggestion. The other xml files don't seem to need compaction but if I could do intelligent comparison of file sizes then I would make them compactable.
2. It removes Configurations2/. This doesn't seem necessary as it's empty and is recreated on modifying the file.
3. It removes layout-cache. This is recreated on modification. It just slows down the file open process a bit.
4. Removes Thumbnails/. This is a tricky one as it's specified in OpenDocument but works OK. I always use a detailed listing in any file manager so can't see the use of thumbnails myself. It will be recreated on editing. I've left in the code for optimising Thumbnails/ just in case you don't want it removed.
This script now reduces the OOo setup guide 1.23 from 868405 to 675327. That's 22.23%. _________________ Arch Linux
OOo 3.2.0
OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/ |
|
| Back to top |
|
 |
AndrewZ Moderator


Joined: 21 Jun 2004 Posts: 4148 Location: Colorado, USA
|
Posted: Tue Nov 22, 2005 10:09 am Post subject: |
|
|
I do like your script: I plan to try the JPEG optimizations (nice idea) against one of my web sites. But I have two more thoughts about your script:
First, there is an option in OpenOffice.org for optimizing XML, and it seems to be on by default.
Second, the script may cause data loss in some situations, and this problem could be avoided by checking that the tools exist and that no error has occured. For example, in the following code:
| Code: | xmllint --nsclean optimod_$1/content.xml > optimod_$1/content.xml.lint
sed '/^ *$/d' optimod_$1/content.xml.lint > optimod_$1/content.xml |
Several things could happen:
1. xmllint does not exist
2. sed does not exist
3. xmllint has a runtime error
4. sed has a runtime error
The runtime errors are probably rare, but perhaps someone would run the script without some of the required tools. |
|
| Back to top |
|
 |
9point9 Moderator

Joined: 31 Aug 2004 Posts: 3904 Location: UK
|
Posted: Tue Nov 22, 2005 1:49 pm Post subject: |
|
|
| ahz wrote: | | I plan to try the JPEG optimizations (nice idea) against one of my web sites. |
Think of all the bandwidth you could save yourself! It's amazing how many poorly compressed images there are out there. jpegoptim does seem a bit hit and miss. It will sometimes knock off 15% from a JPEG but sometimes it gives no improvement. pngcrush gives a more dependable output. I have not yet found it not compacting a file and sometimes it gives over 40%. This is more because of the difference in compression methods between PNG and JPEG. It does also depend upon the program used to produce the image. Some are better than others.
| ahz wrote: | | First, there is an option in OpenOffice.org for optimizing XML, and it seems to be on by default. |
My experiments seem to show that it's not all that well optimised. It doesn't say what it's optimised for. Whatever, we can not expect it to compact better than a dedicated program.
| ahz wrote: | | Second, the script may cause data loss in some situations, and this problem could be avoided by checking that the tools exist and that no error has occured. |
The script is really stupid. I am assuming people to check and make sure that they have these things and make a backup copy before running. Making it more intelligent is something I will do. This is my first real shell script in 7 months, my last, well, that was my first. I'm learning as I go along.
If the outputted file is openable in OOo and looks like the original then it will almost certainly have worked. _________________ Arch Linux
OOo 3.2.0
OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/ |
|
| Back to top |
|
 |
9point9 Moderator

Joined: 31 Aug 2004 Posts: 3904 Location: UK
|
Posted: Thu Dec 01, 2005 1:10 pm Post subject: |
|
|
I've made a whole load of minor changes which may reduce file sizes more:
- Used the '-reduce' switch with pngcrush. This losslessly reduces the bit depth of an image whose content could be expressed using a smaller pallette. This will rarely be applied.
- Used the '-rem alla' pngcrush switch. This will remove all metadata from the PNG images. As the metadata of the document is important, not the image, most people don't set the meta-data and there is no real use for it in a document I think it can be safely taken out to save space.
- Used the '--strip-all' jpegoptim switch. This also removes metadata tags to save space, again, I don't think they're that useful.
- Removed the bit about optimising thumbnails as they get deleted later on anyway. If you want them back, my older code does it.
- Got rid of the sed bit. It didn't do anything as xmllint did the same and there's still an empty line at the end.
Running this on the original test file brings it down to 674540 bytes. That's 22.32% compaction. This is only on a particular file, on one of my own I have managed to save 41.45%! I'd be keen to know if any one can beat this!
| Code: | #!/bin/bash
#######################################################################
# Author: Edward Holness #
# Date: 01/12/2005 #
# ScriptName: optimOD (Optimise OpenDocument) #
# Description: A script that optimises OpenDocument (and *.sx* files,#
# with modicication) losslessy. It unzips the files, #
# optimises PNG and JPEG images, removes unneccessary #
# contents,then recompresses using higher compression #
# yet still being compatible with OpenDocument. #
# Requirements: zip, unzip, pngcrush, jpegoptim and xmllint in your #
# path. #
# Command line instructions: sh optimod <filename> or add to your path#
# Known bugs: #
#######################################################################
if test -f $1 # test that file exists
then echo $1 found # confirmation of file existence
else
echo no file to optimise, exiting nicely. # warn that file doesn't exist and exit
exit
fi
unzip -oq $1 -d optimod_$1 # Extract the file
echo $1 extracted to optimod_$1
#### Pictures
mkdir optimod_$1/Pictures/temp # Make the pictures temp directory
pngcrush -q -d optimod_$1/Pictures/temp -brute -reduce -rem alla optimod_$1/Pictures/*.png # Crush the PNG's
echo PNG graphics optimised
mv optimod_$1/Pictures/temp/*.png optimod_$1/Pictures/ # Move the crushed PNG's on top of the originals
jpegoptim --strip-all --quiet optimod_$1/Pictures/*.jpg # Optimise JPEG's losslessly
echo JPEG graphics optimised
rmdir optimod_$1/Pictures/temp/ # Remove temp dirctory
#### reduce contents, comment out for original content
rmdir optimod_$1/Configurations2/
echo Removed Configurations2/
rm optimod_$1/layout-cache
echo Removed layout-cache
cd optimod_$1/
rm -rf Thumbnails/
echo Removed Thumbnails/
cd ../
#### content.xml
xmllint --nsclean optimod_$1/content.xml > optimod_$1/content.xml.lint
mv optimod_$1/content.xml.lint optimod_$1/content.xml
echo Compacted content.xml
#### Compress
cd optimod_$1/
zip -rq9 ../$1.new *
mv ../$1.new ../$1
echo Compressed
#### Clear up the junk
rm -rf *
cd ../
rmdir optimod_$1/ |
_________________ Arch Linux
OOo 3.2.0
OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/ |
|
| Back to top |
|
 |
9point9 Moderator

Joined: 31 Aug 2004 Posts: 3904 Location: UK
|
Posted: Sun Mar 12, 2006 6:44 am Post subject: |
|
|
I've changed the script to use OptiPNG instead of PNGcrush. It currently uses 1080 runs. This is more than is recommended but gives 3Kb better than with only 240 runs. It now takes a lot longer but does get the files smaller.
I hope to incorporate brute-forcing of zlib window size. That should make a factor of 14 slower but there has got to be a reason to have an Athlon 64 FX-60.
OptiPNG is better than PNGcrush so you could always reduce the number of runs and end up with something faster.
This was doen with OptiPNG 0.5 which doesn't seem to include everything that PNGCrush does yet. I would expect it to get better with later versions.
This can reduce the setup guide draft 1 used above to 669518. That's 22.9%.
| Code: | #!/bin/bash
#######################################################################
# Author: Edward Holness #
# Date: 2006-03-12 #
# ScriptName: optimOD (Optimise OpenDocument) #
# Description: A script that optimises OpenDocument (and *.sx* files,#
# with modification) losslessly. It unzips the files, #
# optimises PNG and JPEG images, removes unneccessary #
# contents,then recompresses using higher compression #
# yet still being compatible with OpenDocument. #
# Requirements: zip, unzip, optipng, jpegoptim and xmllint in your #
# path. #
# Command line instructions: sh optimod <filename> or add to your path#
# Known bugs: #
#######################################################################
if test -f $1 # test that file exists
then echo $1 found # confirmation of file existence
else
echo no file to optimise, exiting nicely. # warn that file doesn't exist and exit
exit
fi
unzip -oq $1 -d optimod_$1 # Extract the file
echo $1 extracted to optimod_$1
#### Pictures
optipng -zc1-9 -zm1-9 -zs0-3 -f0-5 optimod_$1/Pictures/*.png
echo PNG graphics optimised
jpegoptim --strip-all --quiet optimod_$1/Pictures/*.jpg # Optimise JPEG's losslessly
echo JPEG graphics optimised
#### reduce contents, comment out for original content
rmdir optimod_$1/Configurations2/
echo Removed Configurations2/
rm optimod_$1/layout-cache
echo Removed layout-cache
cd optimod_$1/
rm -rf Thumbnails/
echo Removed Thumbnails/
cd ../
#### content.xml
xmllint --nsclean optimod_$1/content.xml > optimod_$1/content.xml.lint
mv optimod_$1/content.xml.lint optimod_$1/content.xml
echo Compacted content.xml
#### Compress
cd optimod_$1/
zip -rq9 ../$1.new *
mv ../$1.new ../$1
echo Compressed
#### Clear up the junk
rm -rf *
cd ../
rmdir optimod_$1/ |
_________________ Arch Linux
OOo 3.2.0
OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/ |
|
| Back to top |
|
 |
9point9 Moderator

Joined: 31 Aug 2004 Posts: 3904 Location: UK
|
Posted: Sun Mar 12, 2006 1:43 pm Post subject: |
|
|
Implementing advancedCOMP gives some improvement:
| Code: | advpng -z4 optimod_$1/Pictures/*.png
advzip ../$1 -z4 |
The setup guide file now comes to 661082 bytes.
Low level modification could also be used to repair corrupt OpenDocument files. I'm think ing about this now and how to implement it. _________________ Arch Linux
OOo 3.2.0
OOoSVN, change control for OOo documents:
http://sourceforge.net/projects/ooosvn/ |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|