OpenOffice.org Forum at OOoForum.orgThe OpenOffice.org Forum
 
 [Home]   [FAQ]   [Search]   [Memberlist]   [Usergroups]   [Register
 [Profile]   [Log in to check your private messages]   [Log in

PERL: Convert .swx files coded in windows-1253 to unicode

 
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Macros and API
View previous topic :: View next topic  
Author Message
rotomano
OOo Enthusiast
OOo Enthusiast


Joined: 13 Dec 2006
Posts: 198
Location: Greece

PostPosted: Wed Mar 21, 2007 7:07 am    Post subject: PERL: Convert .swx files coded in windows-1253 to unicode Reply with quote

I need to convert some old .swx files cretade in StarOffice 5.2, taht was non-unicode and the text in tghem is using the charset windows-1253. I need to convert the windows-1253 text to unicode so as to be able to use it in newer versions of OOo that supports unicode.

So I found a perl script that can do a text coversion, so I was wondering if anyone could guide me how to use the foolowing script in OOo so as to convert the text of text files. How do i convert the following script into OO basic???


MAny thanks for your time!

#!/usr/bin/perl
# win2uni.pl -- Version 0.01 98.02.22 02:57
# Converts Windows-1253 to UTF-8
# Encodings: http://premium.microsoft.com/msdn/library/books/techlang/devintl/d3/s2574.htm
# Algorithm: http://www.talisman.org/utf8.html
# (c) 1998 Hellenic Resources Institute, Inc.

while(<>) { # All others stay as is. This does not convert unknowns to 0xFFFE.
s/\xCE/\xCE\x9E/g; # GREEK CAPITAL LETTER XI

s/\xC2/\xCE\x92/g; # GREEK CAPITAL LETTER BETA
s/\xB2/\xC2\xB2/g; # SUPERSCRIPT TWO
s/\xE2/\xCE\xB2/g; # GREEK SMALL LETTER BETA

s/\xA0/\xC2\xA0/g; # NO-BREAK SPACE
s/\x86/\xE2\x80\xA0/g; # DAGGER
s/\xA2/\xCE\x86/g; # GREEK CAPITAL LETTER ALPHA WITH TONOS
s/\x84/\xE2\x80\x9E/g; # DOUBLE LOW-9 QUOTATION MARK
s/\x99/\xE2\x84\xA2/g; # TRADE MARK SIGN
s/([^\xCE])\x92/$1\xE2\x80\x99/g; # RIGHT SINGLE QUOTATION MARK - Yes, I hate this one... Any better ideas?

s/\xB6/\xC2\xB6/g; # PILCROW SIGN
s/\x93/\xE2\x80\x9C/g; # LEFT DOUBLE QUOTATION MARK
s/\xC3/\xCE\x93/g; # GREEK CAPITAL LETTER GAMMA
s/\xCD/\xCE\x9D/g; # GREEK CAPITAL LETTER NU
s/\xCF/\xCE\x9F/g; # GREEK CAPITAL LETTER OMICRON
s/\xA6/\xC2\xA6/g; # BROKEN BAR
s/\x85/\xE2\x80\xA6/g; # HORIZONTAL ELLIPSIS
s/\xA1/\xCE\x85/g; # GREEK DIALYTIKA TONOS
s/\xA3/\xC2\xA3/g; # POUND SIGN
s/\xA4/\xC2\xA4/g; # CURRENCY SIGN
s/\xA5/\xC2\xA5/g; # YEN SIGN
s/\xB0/\xC2\xB0/g; # DEGREE SIGN
s/\xBA/\xCE\x8A/g; # GREEK CAPITAL LETTER IOTA WITH TONOS
s/\x87/\xE2\x80\xA1/g; # DOUBLE DAGGER
s/\x89/\xE2\x80\xB0/g; # PER MILLE SIGN
s/\xB9/\xCE\x89/g; # GREEK CAPITAL LETTER ETA WITH TONOS
s/\x8B/\xE2\x80\xB9/g; # SINGLE LEFT-POINTING ANGLE QUOTATION MARK
s/\x9B/\xE2\x80\xBA/g; # SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
s/\xB5/\xC2\xB5/g; # MICRO SIGN
s/\x91/\xE2\x80\x98/g; # LEFT SINGLE QUOTATION MARK
s/\xB3/\xC2\xB3/g; # SUPERSCRIPT THREE
s/\x94/\xE2\x80\x9D/g; # RIGHT DOUBLE QUOTATION MARK
s/\x95/\xE2\x80\xA2/g; # BULLET
s/\x96/\xE2\x80\x93/g; # EN DASH
s/\x97/\xE2\x80\x94/g; # EM DASH

s/\xA7/\xC2\xA7/g; # SECTION SIGN
s/\xA8/\xC2\xA8/g; # DIAERESIS
s/\xA9/\xC2\xA9/g; # COPYRIGHT SIGN
s/\xAA/\xC2\xAA/g; # FEMININE ORDINAL INDICATOR
s/\xAB/\xC2\xAB/g; # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
s/\xAC/\xC2\xAC/g; # NOT SIGN
s/\xAD/\xC2\xAD/g; # SOFT HYPHEN
s/\xAE/\xC2\xAE/g; # REGISTERED SIGN
s/\xAF/\xC2\xAF/g; # HORIZONTAL BAR
s/\xB1/\xC2\xB1/g; # PLUS-MINUS SIGN
s/\xB4/\xCE\x84/g; # GREEK TONOS
s/\xB7/\xC2\xB7/g; # MIDDLE DOT
s/\xB8/\xCE\x88/g; # GREEK CAPITAL LETTER EPSILON WITH TONOS
s/\xBB/\xC2\xBB/g; # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
s/\xBC/\xCE\x8C/g; # GREEK CAPITAL LETTER OMICRON WITH TONOS
s/\xBD/\xC2\xBD/g; # VULGAR FRACTION ONE HALF
s/\xBE/\xCE\x8E/g; # GREEK CAPITAL LETTER UPSILON WITH TONOS
s/\xBF/\xCE\x8F/g; # GREEK CAPITAL LETTER OMEGA WITH TONOS
s/\xC0/\xCE\x90/g; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
s/\xC1/\xCE\x91/g; # GREEK CAPITAL LETTER ALPHA
s/\xC4/\xCE\x94/g; # GREEK CAPITAL LETTER DELTA
s/\xC5/\xCE\x95/g; # GREEK CAPITAL LETTER EPSILON
s/\xC6/\xCE\x96/g; # GREEK CAPITAL LETTER ZETA
s/\xC7/\xCE\x97/g; # GREEK CAPITAL LETTER ETA
s/\xC8/\xCE\x98/g; # GREEK CAPITAL LETTER THETA
s/\xC9/\xCE\x99/g; # GREEK CAPITAL LETTER IOTA
s/\xCA/\xCE\x9A/g; # GREEK CAPITAL LETTER KAPPA
s/\xCB/\xCE\x9B/g; # GREEK CAPITAL LETTER LAMDA
s/\xCC/\xCE\x9C/g; # GREEK CAPITAL LETTER MU
s/\xD0/\xCE\xA0/g; # GREEK CAPITAL LETTER PI
s/\xD1/\xCE\xA1/g; # GREEK CAPITAL LETTER RHO
s/\xD3/\xCE\xA3/g; # GREEK CAPITAL LETTER SIGMA
s/\xD4/\xCE\xA4/g; # GREEK CAPITAL LETTER TAU
s/\xD5/\xCE\xA5/g; # GREEK CAPITAL LETTER UPSILON
s/\xD6/\xCE\xA6/g; # GREEK CAPITAL LETTER PHI
s/\xD7/\xCE\xA7/g; # GREEK CAPITAL LETTER CHI
s/\xD8/\xCE\xA8/g; # GREEK CAPITAL LETTER PSI
s/\xD9/\xCE\xA9/g; # GREEK CAPITAL LETTER OMEGA
s/\xDA/\xCE\xAA/g; # GREEK CAPITAL LETTER IOTA WITH DIALYTIKA
s/\xDB/\xCE\xAB/g; # GREEK CAPITAL LETTER UPSILON WITH DIALYTIKA
s/\xDC/\xCE\xAC/g; # GREEK SMALL LETTER ALPHA WITH TONOS
s/\xDD/\xCE\xAD/g; # GREEK SMALL LETTER EPSILON WITH TONOS
s/\xDE/\xCE\xAE/g; # GREEK SMALL LETTER ETA WITH TONOS
s/\xDF/\xCE\xAF/g; # GREEK SMALL LETTER IOTA WITH TONOS
s/\xE0/\xCE\xB0/g; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
s/\xE1/\xCE\xB1/g; # GREEK SMALL LETTER ALPHA
s/\xE3/\xCE\xB3/g; # GREEK SMALL LETTER GAMMA
s/\xE4/\xCE\xB4/g; # GREEK SMALL LETTER DELTA
s/\xE5/\xCE\xB5/g; # GREEK SMALL LETTER EPSILON
s/\xE6/\xCE\xB6/g; # GREEK SMALL LETTER ZETA
s/\xE7/\xCE\xB7/g; # GREEK SMALL LETTER ETA
s/\xE8/\xCE\xB8/g; # GREEK SMALL LETTER THETA
s/\xE9/\xCE\xB9/g; # GREEK SMALL LETTER IOTA
s/\xEA/\xCE\xBA/g; # GREEK SMALL LETTER KAPPA
s/\xEB/\xCE\xBB/g; # GREEK SMALL LETTER LAMDA
s/\xEC/\xCE\xBC/g; # GREEK SMALL LETTER MU
s/\xED/\xCE\xBD/g; # GREEK SMALL LETTER NU
s/\xEE/\xCE\xBE/g; # GREEK SMALL LETTER XI
s/\xEF/\xCE\xBF/g; # GREEK SMALL LETTER OMICRON
s/\xF0/\xCF\xC0/g; # GREEK SMALL LETTER PI
s/\xF1/\xCF\xC1/g; # GREEK SMALL LETTER RHO
s/\xF2/\xCF\xC2/g; # GREEK SMALL LETTER FINAL SIGMA
s/\xF3/\xCF\xC3/g; # GREEK SMALL LETTER SIGMA
s/\xF4/\xCF\xC4/g; # GREEK SMALL LETTER TAU
s/\xF5/\xCF\xC5/g; # GREEK SMALL LETTER UPSILON
s/\xF6/\xCF\xC6/g; # GREEK SMALL LETTER PHI
s/\xF7/\xCF\xC7/g; # GREEK SMALL LETTER CHI
s/\xF8/\xCF\xC8/g; # GREEK SMALL LETTER PSI
s/\xF9/\xCF\xC9/g; # GREEK SMALL LETTER OMEGA
s/\xFA/\xCF\xCA/g; # GREEK SMALL LETTER IOTA WITH DIALYTIKA
s/\xFB/\xCF\xCB/g; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA
s/\xFC/\xCF\xCC/g; # GREEK SMALL LETTER OMICRON WITH TONOS
s/\xFD/\xCF\xCD/g; # GREEK SMALL LETTER UPSILON WITH TONOS
s/\xFE/\xCF\xCE/g; # GREEK SMALL LETTER OMEGA WITH TONOS
s/\x82/\xE2\x80\x9A/g; # SINGLE LOW-9 QUOTATION MARK
s/\x83/\xC6\x92/g; # LATIN SMALL LETTER F WITH HOOK

print;
}
Back to top
View user's profile Send private message
Robert Tucker
Moderator
Moderator


Joined: 16 Aug 2004
Posts: 3407
Location: Manchester UK

PostPosted: Wed Mar 21, 2007 7:15 am    Post subject: Reply with quote

As far as plain text files are concerned iconv will convert between many, many encodings as will a number of plain text editors.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Macros and API All times are GMT - 8 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group