OpenOffice.org Forum at OOoForum.orgThe OpenOffice.org Forum
 
 [Home]   [FAQ]   [Search]   [Memberlist]   [Usergroups]   [Register
 [Profile]   [Log in to check your private messages]   [Log in

How to delete hyphens from an OCR'd document?

 
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Writer
View previous topic :: View next topic  
Author Message
Harvester
General User
General User


Joined: 17 Dec 2004
Posts: 7

PostPosted: Fri Dec 17, 2004 1:24 pm    Post subject: How to delete hyphens from an OCR'd document? Reply with quote

I have an OCR'd text which contains many hyphens and I would like OpenOffice to delete them automatically.

For example I have some text like "antici-pated" and I would like to have "anticipated", of course Smile

I think OOo must be able to do this, because all hyphens are highlighted with a grey box. But I don't know how to do this.
Back to top
View user's profile Send private message
dpeach
OOo Advocate
OOo Advocate


Joined: 06 Oct 2003
Posts: 397
Location: Mérida, Yucatán, México

PostPosted: Fri Dec 17, 2004 2:48 pm    Post subject: Reply with quote

I was going to say to do a find and replace on them until you said it had a gray box around them. This means that OOo has probably made them some kind of "AutoNumbering" or some such.

You don't say whether all of these hyphens were in the original document, or if OOo inserted them. Also, do all the hyphens happen to be at the beginning of the line?

I would first try turnig On/Off Autonumbering. Then, if that does not do it, then try going to Tools | AutoFormat/AutoCorrect and turn off the formatting tools, maybe one by one so you can see which one effects it.
_________________
dpeach
OOo 2.0.4 *** Slackware 11

www.mythoughtspot.com <-- My Blog *** My Podcast --> www.missionarytalks.com
Back to top
View user's profile Send private message
JohnV
Administrator
Administrator


Joined: 07 Mar 2003
Posts: 9183
Location: Lexinton, Kentucky, USA

PostPosted: Fri Dec 17, 2004 6:11 pm    Post subject: Reply with quote

Were the hypens that end up in the middle of a line at the right margin of the doc you OCRed?

When you OCR text into OO and toggle Ctrl+F10 to see formatting characters do you find a paragraph break at the end of each line?

Do you find a series of space before each line?
Back to top
View user's profile Send private message
Harvester
General User
General User


Joined: 17 Dec 2004
Posts: 7

PostPosted: Sat Dec 18, 2004 3:00 am    Post subject: Reply with quote

The "-" were in the original document at the end of the line and indicated that a word was "wrapped" to the next line. (Or so I guess, because I didn't OCR the text for myself.)

I don't see anything special with Ctrl+F10.

I can turn them off with the following procedure:

options-> text document -> format aids -> user defined hyphens: (yes/no)

(It may sound different in the english version of OOo, though... I merely translated the terms from the german version back to english).

This is fine, but I want to delete (search and replace by nothing?) them completely from the text. If I do what I wrote above, I just don't see them. If I save the document afterwards, the hyphens are saved in the new document, too.

edit:
I tried to search and replace the "-" sign, but OOo is unable to find them.
There seem to be three different "-" signs:

- a very short one which has a grey box around it and OOo can't find them when I search for "-" -> these I want to delete, because they appear in words like "shad-ow", "fea-ture" etc.
- a normal sized one (which is found when I search for "-") -> these don't have grey boxes around them and indicate words like "hand-picked" "tooth-like" etc.
- very long ones which indicate intersections in the text -- like this one here -- etc.
Back to top
View user's profile Send private message
Robert Tucker
Moderator
Moderator


Joined: 16 Aug 2004
Posts: 3407
Location: Manchester UK

PostPosted: Sat Dec 18, 2004 5:28 am    Post subject: Reply with quote

Check the help files under 'field shadings' and 'custom hyphens' .

On the English version of OpenOffice there is a “check box” at:

Tools>Options>Text Document>Formatting Aids>Custom hyphens

but that looks rather like what you have already described.

If what you describe as OCR is in fact the extraction of text from a PDF document, you may find it easier, with reference to the hyphens, to use a “pdf to text” tool rather than “pdf to Word or .doc”. (Or maybe you could save the .sxw as .txt or paste it into a text editor (Notepad on Microsoft, gedit, say, on Linux), save it and then try opening the .txt in OpenOffice.)

Sorry if this isn't over helpful, but I do know that translators often have problems with end-of-line returns when trying to use “computer-aided translation” tools on text produced by OCR (or rather “pdf to Word” tools).

The search and replace a hyphen with nothing does seem to work with “normal” hyphens, I found.
Back to top
View user's profile Send private message
JohnV
Administrator
Administrator


Joined: 07 Mar 2003
Posts: 9183
Location: Lexinton, Kentucky, USA

PostPosted: Sat Dec 18, 2004 8:01 am    Post subject: Reply with quote

I find it really strange that you would receive a doc resulting from OCR that contained OO's custom hypens but I guess you did.

If I create a doc with custom hypems inserted with either Ctrl+- or Shift+Ctrl+- then I can select either type of custom hyphen and it will appear in the Find box (the latter displayed as a square) when I go to Find & Replace. It can then be replace with nothing.

Another approach might be to Save As the doc to Text and reopen that file though you could lose other formatting.
Back to top
View user's profile Send private message
Harvester
General User
General User


Joined: 17 Dec 2004
Posts: 7

PostPosted: Sat Dec 18, 2004 10:16 am    Post subject: Reply with quote

JohnV wrote:
If I create a doc with custom hypems inserted with either Ctrl+- or Shift+Ctrl+-

Ah, that's how I make them... I didn't know this before, either.

Quote:
then I can select either type of custom hyphen and it will appear in the Find box (the latter displayed as a square) when I go to Find & Replace. It can then be replace with nothing.

Great, this did the trick!
I couldn't select the hyphen with my mouse, but with shift+arrowleft/-right it did work indeed. Afterwards, I could search for them with Ctrl+G and replace them with nothing. Thanks a lot!


Quote:
Another approach might be to Save As the doc to Text and reopen that file though you could lose other formatting.


I tried this before, but it didn't work. Seems that even .txt-Files differentiate between normal hyphens and custom hyphens. (It also didn't work if I disabled the "show custom hyphens"-option and then "save as"-> ".txt" (or Ctrl+C -> Ctrl+V in notepad).)
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Writer All times are GMT - 8 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group