| View previous topic :: View next topic |
| Author |
Message |
rotomano OOo Enthusiast


Joined: 13 Dec 2006 Posts: 198 Location: Greece
|
Posted: Wed Mar 21, 2007 7:07 am Post subject: PERL: Convert .swx files coded in windows-1253 to unicode |
|
|
I need to convert some old .swx files cretade in StarOffice 5.2, taht was non-unicode and the text in tghem is using the charset windows-1253. I need to convert the windows-1253 text to unicode so as to be able to use it in newer versions of OOo that supports unicode.
So I found a perl script that can do a text coversion, so I was wondering if anyone could guide me how to use the foolowing script in OOo so as to convert the text of text files. How do i convert the following script into OO basic???
MAny thanks for your time!
#!/usr/bin/perl
# win2uni.pl -- Version 0.01 98.02.22 02:57
# Converts Windows-1253 to UTF-8
# Encodings: http://premium.microsoft.com/msdn/library/books/techlang/devintl/d3/s2574.htm
# Algorithm: http://www.talisman.org/utf8.html
# (c) 1998 Hellenic Resources Institute, Inc.
while(<>) { # All others stay as is. This does not convert unknowns to 0xFFFE.
s/\xCE/\xCE\x9E/g; # GREEK CAPITAL LETTER XI
s/\xC2/\xCE\x92/g; # GREEK CAPITAL LETTER BETA
s/\xB2/\xC2\xB2/g; # SUPERSCRIPT TWO
s/\xE2/\xCE\xB2/g; # GREEK SMALL LETTER BETA
s/\xA0/\xC2\xA0/g; # NO-BREAK SPACE
s/\x86/\xE2\x80\xA0/g; # DAGGER
s/\xA2/\xCE\x86/g; # GREEK CAPITAL LETTER ALPHA WITH TONOS
s/\x84/\xE2\x80\x9E/g; # DOUBLE LOW-9 QUOTATION MARK
s/\x99/\xE2\x84\xA2/g; # TRADE MARK SIGN
s/([^\xCE])\x92/$1\xE2\x80\x99/g; # RIGHT SINGLE QUOTATION MARK - Yes, I hate this one... Any better ideas?
s/\xB6/\xC2\xB6/g; # PILCROW SIGN
s/\x93/\xE2\x80\x9C/g; # LEFT DOUBLE QUOTATION MARK
s/\xC3/\xCE\x93/g; # GREEK CAPITAL LETTER GAMMA
s/\xCD/\xCE\x9D/g; # GREEK CAPITAL LETTER NU
s/\xCF/\xCE\x9F/g; # GREEK CAPITAL LETTER OMICRON
s/\xA6/\xC2\xA6/g; # BROKEN BAR
s/\x85/\xE2\x80\xA6/g; # HORIZONTAL ELLIPSIS
s/\xA1/\xCE\x85/g; # GREEK DIALYTIKA TONOS
s/\xA3/\xC2\xA3/g; # POUND SIGN
s/\xA4/\xC2\xA4/g; # CURRENCY SIGN
s/\xA5/\xC2\xA5/g; # YEN SIGN
s/\xB0/\xC2\xB0/g; # DEGREE SIGN
s/\xBA/\xCE\x8A/g; # GREEK CAPITAL LETTER IOTA WITH TONOS
s/\x87/\xE2\x80\xA1/g; # DOUBLE DAGGER
s/\x89/\xE2\x80\xB0/g; # PER MILLE SIGN
s/\xB9/\xCE\x89/g; # GREEK CAPITAL LETTER ETA WITH TONOS
s/\x8B/\xE2\x80\xB9/g; # SINGLE LEFT-POINTING ANGLE QUOTATION MARK
s/\x9B/\xE2\x80\xBA/g; # SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
s/\xB5/\xC2\xB5/g; # MICRO SIGN
s/\x91/\xE2\x80\x98/g; # LEFT SINGLE QUOTATION MARK
s/\xB3/\xC2\xB3/g; # SUPERSCRIPT THREE
s/\x94/\xE2\x80\x9D/g; # RIGHT DOUBLE QUOTATION MARK
s/\x95/\xE2\x80\xA2/g; # BULLET
s/\x96/\xE2\x80\x93/g; # EN DASH
s/\x97/\xE2\x80\x94/g; # EM DASH
s/\xA7/\xC2\xA7/g; # SECTION SIGN
s/\xA8/\xC2\xA8/g; # DIAERESIS
s/\xA9/\xC2\xA9/g; # COPYRIGHT SIGN
s/\xAA/\xC2\xAA/g; # FEMININE ORDINAL INDICATOR
s/\xAB/\xC2\xAB/g; # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
s/\xAC/\xC2\xAC/g; # NOT SIGN
s/\xAD/\xC2\xAD/g; # SOFT HYPHEN
s/\xAE/\xC2\xAE/g; # REGISTERED SIGN
s/\xAF/\xC2\xAF/g; # HORIZONTAL BAR
s/\xB1/\xC2\xB1/g; # PLUS-MINUS SIGN
s/\xB4/\xCE\x84/g; # GREEK TONOS
s/\xB7/\xC2\xB7/g; # MIDDLE DOT
s/\xB8/\xCE\x88/g; # GREEK CAPITAL LETTER EPSILON WITH TONOS
s/\xBB/\xC2\xBB/g; # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
s/\xBC/\xCE\x8C/g; # GREEK CAPITAL LETTER OMICRON WITH TONOS
s/\xBD/\xC2\xBD/g; # VULGAR FRACTION ONE HALF
s/\xBE/\xCE\x8E/g; # GREEK CAPITAL LETTER UPSILON WITH TONOS
s/\xBF/\xCE\x8F/g; # GREEK CAPITAL LETTER OMEGA WITH TONOS
s/\xC0/\xCE\x90/g; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
s/\xC1/\xCE\x91/g; # GREEK CAPITAL LETTER ALPHA
s/\xC4/\xCE\x94/g; # GREEK CAPITAL LETTER DELTA
s/\xC5/\xCE\x95/g; # GREEK CAPITAL LETTER EPSILON
s/\xC6/\xCE\x96/g; # GREEK CAPITAL LETTER ZETA
s/\xC7/\xCE\x97/g; # GREEK CAPITAL LETTER ETA
s/\xC8/\xCE\x98/g; # GREEK CAPITAL LETTER THETA
s/\xC9/\xCE\x99/g; # GREEK CAPITAL LETTER IOTA
s/\xCA/\xCE\x9A/g; # GREEK CAPITAL LETTER KAPPA
s/\xCB/\xCE\x9B/g; # GREEK CAPITAL LETTER LAMDA
s/\xCC/\xCE\x9C/g; # GREEK CAPITAL LETTER MU
s/\xD0/\xCE\xA0/g; # GREEK CAPITAL LETTER PI
s/\xD1/\xCE\xA1/g; # GREEK CAPITAL LETTER RHO
s/\xD3/\xCE\xA3/g; # GREEK CAPITAL LETTER SIGMA
s/\xD4/\xCE\xA4/g; # GREEK CAPITAL LETTER TAU
s/\xD5/\xCE\xA5/g; # GREEK CAPITAL LETTER UPSILON
s/\xD6/\xCE\xA6/g; # GREEK CAPITAL LETTER PHI
s/\xD7/\xCE\xA7/g; # GREEK CAPITAL LETTER CHI
s/\xD8/\xCE\xA8/g; # GREEK CAPITAL LETTER PSI
s/\xD9/\xCE\xA9/g; # GREEK CAPITAL LETTER OMEGA
s/\xDA/\xCE\xAA/g; # GREEK CAPITAL LETTER IOTA WITH DIALYTIKA
s/\xDB/\xCE\xAB/g; # GREEK CAPITAL LETTER UPSILON WITH DIALYTIKA
s/\xDC/\xCE\xAC/g; # GREEK SMALL LETTER ALPHA WITH TONOS
s/\xDD/\xCE\xAD/g; # GREEK SMALL LETTER EPSILON WITH TONOS
s/\xDE/\xCE\xAE/g; # GREEK SMALL LETTER ETA WITH TONOS
s/\xDF/\xCE\xAF/g; # GREEK SMALL LETTER IOTA WITH TONOS
s/\xE0/\xCE\xB0/g; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
s/\xE1/\xCE\xB1/g; # GREEK SMALL LETTER ALPHA
s/\xE3/\xCE\xB3/g; # GREEK SMALL LETTER GAMMA
s/\xE4/\xCE\xB4/g; # GREEK SMALL LETTER DELTA
s/\xE5/\xCE\xB5/g; # GREEK SMALL LETTER EPSILON
s/\xE6/\xCE\xB6/g; # GREEK SMALL LETTER ZETA
s/\xE7/\xCE\xB7/g; # GREEK SMALL LETTER ETA
s/\xE8/\xCE\xB8/g; # GREEK SMALL LETTER THETA
s/\xE9/\xCE\xB9/g; # GREEK SMALL LETTER IOTA
s/\xEA/\xCE\xBA/g; # GREEK SMALL LETTER KAPPA
s/\xEB/\xCE\xBB/g; # GREEK SMALL LETTER LAMDA
s/\xEC/\xCE\xBC/g; # GREEK SMALL LETTER MU
s/\xED/\xCE\xBD/g; # GREEK SMALL LETTER NU
s/\xEE/\xCE\xBE/g; # GREEK SMALL LETTER XI
s/\xEF/\xCE\xBF/g; # GREEK SMALL LETTER OMICRON
s/\xF0/\xCF\xC0/g; # GREEK SMALL LETTER PI
s/\xF1/\xCF\xC1/g; # GREEK SMALL LETTER RHO
s/\xF2/\xCF\xC2/g; # GREEK SMALL LETTER FINAL SIGMA
s/\xF3/\xCF\xC3/g; # GREEK SMALL LETTER SIGMA
s/\xF4/\xCF\xC4/g; # GREEK SMALL LETTER TAU
s/\xF5/\xCF\xC5/g; # GREEK SMALL LETTER UPSILON
s/\xF6/\xCF\xC6/g; # GREEK SMALL LETTER PHI
s/\xF7/\xCF\xC7/g; # GREEK SMALL LETTER CHI
s/\xF8/\xCF\xC8/g; # GREEK SMALL LETTER PSI
s/\xF9/\xCF\xC9/g; # GREEK SMALL LETTER OMEGA
s/\xFA/\xCF\xCA/g; # GREEK SMALL LETTER IOTA WITH DIALYTIKA
s/\xFB/\xCF\xCB/g; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA
s/\xFC/\xCF\xCC/g; # GREEK SMALL LETTER OMICRON WITH TONOS
s/\xFD/\xCF\xCD/g; # GREEK SMALL LETTER UPSILON WITH TONOS
s/\xFE/\xCF\xCE/g; # GREEK SMALL LETTER OMEGA WITH TONOS
s/\x82/\xE2\x80\x9A/g; # SINGLE LOW-9 QUOTATION MARK
s/\x83/\xC6\x92/g; # LATIN SMALL LETTER F WITH HOOK
print;
} |
|
| Back to top |
|
 |
Robert Tucker Moderator


Joined: 16 Aug 2004 Posts: 3367 Location: Manchester UK
|
Posted: Wed Mar 21, 2007 7:15 am Post subject: |
|
|
| As far as plain text files are concerned iconv will convert between many, many encodings as will a number of plain text editors. |
|
| Back to top |
|
 |
|