| View previous topic :: View next topic |
| Author |
Message |
Harvester General User

Joined: 17 Dec 2004 Posts: 7
|
Posted: Fri Dec 17, 2004 1:24 pm Post subject: How to delete hyphens from an OCR'd document? |
|
|
I have an OCR'd text which contains many hyphens and I would like OpenOffice to delete them automatically.
For example I have some text like "antici-pated" and I would like to have "anticipated", of course
I think OOo must be able to do this, because all hyphens are highlighted with a grey box. But I don't know how to do this. |
|
| Back to top |
|
 |
dpeach OOo Advocate


Joined: 06 Oct 2003 Posts: 397 Location: Mérida, Yucatán, México
|
Posted: Fri Dec 17, 2004 2:48 pm Post subject: |
|
|
I was going to say to do a find and replace on them until you said it had a gray box around them. This means that OOo has probably made them some kind of "AutoNumbering" or some such.
You don't say whether all of these hyphens were in the original document, or if OOo inserted them. Also, do all the hyphens happen to be at the beginning of the line?
I would first try turnig On/Off Autonumbering. Then, if that does not do it, then try going to Tools | AutoFormat/AutoCorrect and turn off the formatting tools, maybe one by one so you can see which one effects it. _________________ dpeach
OOo 2.0.4 *** Slackware 11
www.mythoughtspot.com <-- My Blog *** My Podcast --> www.missionarytalks.com |
|
| Back to top |
|
 |
JohnV Administrator

Joined: 07 Mar 2003 Posts: 8978 Location: Lexinton, Kentucky, USA
|
Posted: Fri Dec 17, 2004 6:11 pm Post subject: |
|
|
Were the hypens that end up in the middle of a line at the right margin of the doc you OCRed?
When you OCR text into OO and toggle Ctrl+F10 to see formatting characters do you find a paragraph break at the end of each line?
Do you find a series of space before each line? |
|
| Back to top |
|
 |
Harvester General User

Joined: 17 Dec 2004 Posts: 7
|
Posted: Sat Dec 18, 2004 3:00 am Post subject: |
|
|
The "-" were in the original document at the end of the line and indicated that a word was "wrapped" to the next line. (Or so I guess, because I didn't OCR the text for myself.)
I don't see anything special with Ctrl+F10.
I can turn them off with the following procedure:
options-> text document -> format aids -> user defined hyphens: (yes/no)
(It may sound different in the english version of OOo, though... I merely translated the terms from the german version back to english).
This is fine, but I want to delete (search and replace by nothing?) them completely from the text. If I do what I wrote above, I just don't see them. If I save the document afterwards, the hyphens are saved in the new document, too.
edit:
I tried to search and replace the "-" sign, but OOo is unable to find them.
There seem to be three different "-" signs:
- a very short one which has a grey box around it and OOo can't find them when I search for "-" -> these I want to delete, because they appear in words like "shad-ow", "fea-ture" etc.
- a normal sized one (which is found when I search for "-") -> these don't have grey boxes around them and indicate words like "hand-picked" "tooth-like" etc.
- very long ones which indicate intersections in the text -- like this one here -- etc. |
|
| Back to top |
|
 |
Robert Tucker Moderator


Joined: 16 Aug 2004 Posts: 3367 Location: Manchester UK
|
Posted: Sat Dec 18, 2004 5:28 am Post subject: |
|
|
Check the help files under 'field shadings' and 'custom hyphens' .
On the English version of OpenOffice there is a “check box” at:
Tools>Options>Text Document>Formatting Aids>Custom hyphens
but that looks rather like what you have already described.
If what you describe as OCR is in fact the extraction of text from a PDF document, you may find it easier, with reference to the hyphens, to use a “pdf to text” tool rather than “pdf to Word or .doc”. (Or maybe you could save the .sxw as .txt or paste it into a text editor (Notepad on Microsoft, gedit, say, on Linux), save it and then try opening the .txt in OpenOffice.)
Sorry if this isn't over helpful, but I do know that translators often have problems with end-of-line returns when trying to use “computer-aided translation” tools on text produced by OCR (or rather “pdf to Word” tools).
The search and replace a hyphen with nothing does seem to work with “normal” hyphens, I found. |
|
| Back to top |
|
 |
JohnV Administrator

Joined: 07 Mar 2003 Posts: 8978 Location: Lexinton, Kentucky, USA
|
Posted: Sat Dec 18, 2004 8:01 am Post subject: |
|
|
I find it really strange that you would receive a doc resulting from OCR that contained OO's custom hypens but I guess you did.
If I create a doc with custom hypems inserted with either Ctrl+- or Shift+Ctrl+- then I can select either type of custom hyphen and it will appear in the Find box (the latter displayed as a square) when I go to Find & Replace. It can then be replace with nothing.
Another approach might be to Save As the doc to Text and reopen that file though you could lose other formatting. |
|
| Back to top |
|
 |
Harvester General User

Joined: 17 Dec 2004 Posts: 7
|
Posted: Sat Dec 18, 2004 10:16 am Post subject: |
|
|
| JohnV wrote: | | If I create a doc with custom hypems inserted with either Ctrl+- or Shift+Ctrl+- |
Ah, that's how I make them... I didn't know this before, either.
| Quote: | | then I can select either type of custom hyphen and it will appear in the Find box (the latter displayed as a square) when I go to Find & Replace. It can then be replace with nothing. |
Great, this did the trick!
I couldn't select the hyphen with my mouse, but with shift+arrowleft/-right it did work indeed. Afterwards, I could search for them with Ctrl+G and replace them with nothing. Thanks a lot!
| Quote: | | Another approach might be to Save As the doc to Text and reopen that file though you could lose other formatting. |
I tried this before, but it didn't work. Seems that even .txt-Files differentiate between normal hyphens and custom hyphens. (It also didn't work if I disabled the "show custom hyphens"-option and then "save as"-> ".txt" (or Ctrl+C -> Ctrl+V in notepad).) |
|
| Back to top |
|
 |
|