OpenOffice.org Forum at OOoForum.orgThe OpenOffice.org Forum
 
 [Home]   [FAQ]   [Search]   [Memberlist]   [Usergroups]   [Register
 [Profile]   [Log in to check your private messages]   [Log in

Convert ASCII text by eliminating extra paragraph breaks
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Code Snippets
View previous topic :: View next topic  
Author Message
JohnV
Administrator
Administrator


Joined: 07 Mar 2003
Posts: 9183
Location: Lexinton, Kentucky, USA

PostPosted: Mon Mar 01, 2004 5:24 pm    Post subject: Convert ASCII text by eliminating extra paragraph breaks Reply with quote

EDITED 2-26-05 to change to version 2.
EDITED 5-1-05 to improve the 'reduce paragraph spacing' subroutine and change name of starting subroutine.
EDITED 5-6-06 to correct a bug in the 'reduce paragraph spacing' subroutine that under certain circumstances would eat you document one character at a time. Thanks to Robbyn for catching this.

This macro removes the excess paragraph breaks from an ASCII editor type file and also works on text coped & pasted from the Web that has line breaks inserted like a message in these forums.

It also provides provides other options to indent each paragraph, reduce the spacing between paragraphs, chage spaced indent to tabs, remove excess interior spacing, strip all indents and justification of he results. One routine in this macro is designed to reformat text that has been scanned and then OCRed directly into OO. If your OCR program creates a left margin by inserting spaces before each line then you might find this handy.

Trying to reformat an ASCII file is an exercise in guesswork at best and you shouldn't expect perfect results. This macro has to make assumptions about what is truly the end of a paragraph, a title or part of a list.

Code:
'Version 2.2   5-6-06  John Vigor
'Converts ASCII text files, or selected text within them, by stripping out excess
'paragraph breaks. Works with items copied & pasted from the Web that contain line
'breaks such as a message in these forums.
'WARNING - Anything stored on the Clipboard will be overwritten. A copy of your
'original file will not be saved to the Clipboard if a file greater then 60K
'characters or selected text of any size and you are responsible for otherwise
'backing it up. On the other hand, your original file will not be changed unless you
'save the macro results and overwrite the original but this assumes a saved file.
'Sample processing times on a 770 MHz machine in Pages/Seconds format:
'  10/4, 20/7, 40/16, 80/34, 160/81 (1.35 Min.).
'Hint for long documents - TURN OFF AUTO-SPELLCHECKING. This will save time.
'You can control what items the program asks you about and what happens if you choose
'not to be asked about an item by editing the variables below. These variables are
'ignored if you run the macro on selected text so if you customize them and need to
'be asked about items for a particular file then simply select the entire file with
'Ctrl+A before running the macro.
'You should become familiar with how the program works before you attempt to customize
'it. It works differently if the file is less or more than 60K and when processing a
'selection.
'
Sub ASCII_Formatter_StartHere
'VARIABLES YOU CAN CHANGE.
AskShortParagraphs = True 'Show the query about keeping short paragraphs.
 'Default answer if AskShortParagraphs is False.
 KeepShortParagraphs = True 'HIGHLY recommended and the faster of the two methods. I
 'only maintain the other method because it's needed for one option.
'--
AskShortParaLength = True 'Show a chance to adjust the program's estimate of what should
'be considered a short paragraph that will kept.
ShortDef = 20 'The minimum number of characters short of the right margin that a line
'must end to be a short paragraph. The paragraph break at the end of any short paragraph
'will be maintained.
'--
AskViewOptions = True 'Show the query about end of program Options.
 'Default, if AskViewOptions is False.
 GoToOptions = True 'If True the Option variables below will take control. They will
 'also control if AskViewOptions is True and you choose to go to the Options.
'--
'The following values control what Options are displayed following initial file processing:
Show1stOptionSet = True 'Show the first set of Options, which are Indent All Paragraphs
'and Reduce Paragraph Spacing.
 'Defaults, if Show1stOptionSet is False.
 O1_1 = False 'Indent all paragraphs.
 O1_2 = False 'Reduce paragraph spacing.
  AskMaxParaSpacing = True 'Ask for the maximum number of blank "lines" between paragraphs.
   O1_2_1 = 1 'Default, if AskMaxParaSpacing is False.
Show2ndOptionSet = True 'Options are Remove Excess Interior Spaces, Change Spaced
'Indents to Tabs and Justify All Paragraphs.
 'Defaults, if Show2ndOptionset is False.
 O2_1 = False 'Remove excess interior spaces.
 O2_2 = False 'Change spaced indents to tabs.
 O2_3 = False 'Justify text.
'--
'The following control aspects of full file, selection and/or Option processing and
'are not ignored if you process a selection:
ShowBackUpWarning = True 'Show warning that file not copied to Clipboard if over 60K.
ShowFinished = True 'Show the finished message.
PageBite = 10 'Number of pages processed at a time for selected text or a file over 60K
'characters. Seems pretty good but I haven't seriously tried to optimize it.
ViewBegin = True 'The cursor and your view are taken to the beginning of the document
'when macro ends. "False" will leave you at the end.
StripHyphens = True 'Assumes hyphens located at the right margin are editorial and do
'not seperate true hyphenated words like "half-dollar".   
CheckForcedLeftMargin = True 'A VERY quick routine if spaces imitating a left margin
'don't exist. Or it will delete such a margin if they do (known to work for my version of
'ScanSoft's OCR software - the 1st line of the file will contain only a series of spaces).
StripIndents = "0" 'This controls what happens during initial file processing. The default
'value is recommended. The default ("0") is to leave all tabbed or spaced paragraph indents,
'"1" will strip spaced indents, "2" will strip tabbed ones and "3" will strip both kinds.
'If you run the macro on selected text and don't choose to go to the Options then you
'will also be asked about this. One use for this is to deal with paragraphs offset, as
'opposed to just 1st line indented, from the left margin. Normal processing will leave
'all "lines" of such paragraphs indented and followed by paragraph breaks. Stripping
'out the indents will convert these to regular paragraphs which can be edited normally
'in Writer. Running the macro on selected text after normal processing is a personal
'favorite of mine because I often scan documents with offsets which I need to edit.
MarkIt = True 'Insert the "MarkWith" character as the last character of the file to
'indicate it has been previouly processed. Allows you to run the macro again and go
'directly to the other Options without wading through the full initial file processing
'again. I do not recommend changing as it controls how the program works on a 2nd run.
MarkWith = Chr(160) 'A nonbreaking space in most fonts, which isn't seen on printing.
Override60K = False 'Setting this to True may significantly slow runtime on large files
'or selections but no seperate processing document will be used if you don't like this.
'.....................................................................................
'DO NOT EDIT BELOW HERE UNLESS YOU KNOW WHAT YOU ARE DOING.
lTime = Timer : RunTime = 0 : Skip = false : Over60K = false : ProcSel = false
PrevProc = false : IsSelect = false : ASPL = AskShortParaLength : LastSection = False
thisDoc = thisComponent : oDoc = thisDoc 'oDoc may change.
thisVC = thisDoc.CurrentController.getViewCursor : oVC = thisVC 'oVC may change.
thisText = thisDoc.Text 
MarkSel = thisDoc.Text.createTextCursorByRange(thisVC)'Mark any selection.
thisFrame = thisDoc.CurrentController.getFrame()
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")
FandR = thisDoc.createReplaceDescriptor() 'Find & Replace initial set up.
FandR.searchRegularExpression = true      'Use regular expressions.
If NOT thisVC.isCollapsed then IsSelect = true
If IsSelect then
  a$ = "Process selected text only?"& Chr(13) & Chr(13) &"Cancel to Quit."
  RunTime = RunTime + (Timer - lTime)
  iAns = MsgBox(a$,3,"Text in this file has been selected.")
  lTime = Timer : If iAns = 2 then End
  If iAns = 6 then ProcSel = true
EndIf
If ProcSel then 'Tests for over 60K.
 If Len(MarkSel.String) > 60000 OR Len(MarkSel.String) = 0 then Over60K = true
 ELSEIf thisDoc.characterCount > 60000 then Over60K = true
EndIF
If Override60K then Over60K = false
If NOT Over60K then LastSection = true
If Over60K AND NOT ProcSel then MarkSel.gotoStart(false) : MarkSel.gotoEnd(true)
thisTC = thisDoc.Text.createTextCursor : thisTC.gotoEnd(false)
thisTC.goLeft(1,true) 'Get the last character. If a binding space
 'then file was previously processed. Process or go to Options?
IF thisTC.String = MarkWith then 'Was file previouly processed?
 PrevProc = True 'Previously processed.
 Skip = SkipToOptions(Show1stOptionSet,Show2ndOptionSet)
EndIf
'++++++++++++++++++++++++++ All basic information gathered, now control program flow.
Select Case Skip
 Case True  'Go directly to the options.
  Select Case ProcSel 'Do we need to deal with selected text?
   Case False : BackUp(thisFrame,dispatcher)
         RunOptions(Show1stOptionSet,Show2ndOptionSet,FandR)
   Case True : NoBackUpWarning(ShowBackUpWarning)
         SetUpSelection(thisFrame,thatFrame,"Copy",dispatcher,MarkSel,IsSelect,Skip)
         RunOptions(Show1stOptionSet,Show2ndOptionSet,FandR)
         FinishSelection(thatFrame,MarkSel,FandR,dispatcher,MarkSel,thisVC)
  End Select
 Case False 'Going to do the regular file processing.
  Select Case ProcSel 'Do we need to deal selected text?
   Case True : NoBackUpWarning(ShowBackUpWarning)
         AskShortParas(AskShortParagraphs,ShortDef,thisTC,ASPL)
         AskIndents() : If CheckForcedLeftMargin then ForcedLeftMargin(thisTC,FandR)
         SetUpSelection(thisFrame,thatFrame,"Copy",dispatcher,MarkSel,IsSelect,Skip)
         If Over60K then
           ProcessOver60K(FandR,thisVC,thisDoc,dispatcher,MarkSel,PageBite)
          Else RunMainRoutines(FandR)
           FinishSelection(thatFrame,MarkSel,FandR,dispatcher,MarkSel,thisVC)
         EndIf
         oDoc = thisDoc
         AskRunOptions(AskViewOptions,Show1stOptionSet,Show2ndOptionSet,FandR)
   Case False 'Normally this would be the 1st time entire file is processed. 
              'Won't ask about indents because most users won't care.
         IF Over60K then NoBackUpWarning(ShowBackUpWarning)     
         If NOT Over60K then BackUp(thisFrame,dispatcher)
         AskShortParas(AskShortParagraphs,ShortDef,thisTC,ASPL)
         If CheckForcedLeftMargin then ForcedLeftMargin(thisTC,FandR)
         If Over60K then
           SetUpSelection(thisFrame,thatFrame,"Copy",dispatcher,MarkSel,IsSelect,Skip)
           ProcessOver60K(FandR,thisVC,thisDoc,dispatcher,MarkSel,PageBite)
           oDoc = thisDoc
          Else
           RunMainRoutines(FandR)
         EndIf
         AskRunOptions(AskViewOptions,Show1stOptionSet,Show2ndOptionSet,FandR)
  End Select               
End Select
'++++++++++++++++++++++++++++++
thisVC.gotoEnd(False)
If NOT PrevProc then
 thisText.insertString(thisVC,MarkWith,False)
EndIf
EndMessage(ShowFinished,MarkIt)
If ViewBegin then thisVC.gotoStart(false)
End Sub

Private lTime, KeepShortParagraphs, oVC, StripHyphens, LastSection
Private oDoc, RunTime, Over60K, thisFrame, thatFrame, thisVC, PrevProc
Private StripIndents, jstify, dispatcher, ShortPara, ProcSel
Private O1_1, O1_2, O1_2_1 ,O2_1, O2_2, O2_3, AskMaxParaSpacing, GoToOptions

Function SkipToOptions(Show1stOptionSet,Show2ndOptionSet)
a$ = "Do you want to skip directly to the end of program Options?"
b$ = Chr(13) & "Cancel to quit." : RunTime = RunTime + (Timer - lTime)
iAns = MsgBox(a$ & b$,3,"This file has been previously processed.")
lTime = Timer : If iAns = 2 then End
If iAns = 7 then
  SkipToOptions() = false
 Else Show1stOptionSet = true : Show2ndOptionSet = true : SkipToOptions() = true
EndIf
End Function

Sub CopyOrCutIt(whichFrame,doWhich,dispatcher)
 dispatcher.executeDispatch(whichFrame,".uno:" & doWhich,"",0,Array())
End Sub

Sub PasteIt(whichFrame,dispatcher)
 dispatcher.executeDispatch(whichFrame,".uno:Paste","",0,Array())
End Sub

Sub BackUp(thisFrame,dispatcher) 'Back up file to clipboard.
 dispatcher.executeDispatch(thisFrame,".uno:SelectAll","",0,Array())
 CopyOrCutIt(thisFrame,"Copy",dispatcher)
End Sub

Sub NoBackupWarning(ShowBackUpWarning)
 If NOT ShowBackUpWarning then Exit Sub
 a$ = "Because you are processing selected text or the file contains more than 60K "
 a$ = a$ & "characters no backup copy of your file will be stored to the clipboard!"
 a$ = a$ & " Optional > Auto-spellcheck should be off." & Chr(13)
 b$= "DO YOU HAVE THIS FILE BACKED UP?  ('Yes' will continue - 'No' will abort.)"
 c$ = String(15," ") & "NO BACKUP!" : RunTime = RunTime + (Timer - lTime)
 iAns = MsgBox (a$ & b$,308,c$ & c$ & c$) : lTime = Timer
 If iAns = 7 then End
End Sub

Sub SetUpSelection(thisFrame,thatFrame,doWhat,dispatcher,MarkSel,IsSelect,Skip)
 oDoc = StarDesktop.loadComponentFromURL("private:factory/swriter","_blank",0,Array())
 thatFrame = oDoc.CurrentController.Frame 'Get new doc frame.
 oVC = oDoc.CurrentController.getViewCursor 'Set oVC to new doc.
 If NOT Over60K OR Skip then
  thisVC.gotoRange(MarkSel.Start,false) : thisVC.gotoRange(MarkSel.End,true)
  CopyOrCutIt(thisFrame,"Copy", dispatcher) 'Then fill it
  PasteIt(thatFrame,dispatcher)             'with the selection.
  BackUp(thisFrame,dispatcher) 'If not big selection or are doing Options
 EndIf                         'then backup file to clipboard.
End Sub

Sub FinishSelection(thatFrame,MarkSel,FandR,dispatcher,MarkSel,thisVC)
 dispatcher.executeDispatch(thatFrame,".uno:SelectAll","",0,Array())
 CopyOrCutIt(thatFrame,"Cut",dispatcher)
 thisVC.gotoRange(MarkSel.Start,false)
 thisVC.gotoRange(MarkSel.End,true)
 PasteIt(thisFrame,dispatcher)
 oDoc.dispose : If jstify then thisVC.setPropertyValue("ParaAdjust",2)
End Sub

Sub AskShortParas(AskShortParagraphs,ShortDef,thisTC,ASPL)
 thisTC.gotoStart(false)
 While thisTC.isEndOfParagraph() 'Delete empty paragraphs at top of doc.
  thisTC.goRight(1,true) : thisTC.String = ""
 Wend
 If NOT AskShortParagraphs AND NOT KeepShortParagraphs AND NOT ProcSel then Exit Sub
 If NOT AskShortParagraphs AND NOT ProcSel then iAns = 6 : goto Continue
 a$ = "Generally highly recommended. This saves the formatting of lists,"
 b$ = " tables of contents," & Chr(13) & "indexes, etc. that are not indented."
 c$ = " This is also the fastest method."
 d$ = "KEEP SHORT PARAGRAPHS?"
 RunTime = RunTime + (Timer - lTime)
 iAns = Msgbox (a$ & b$ & c$,3,d$) : lTime = Timer
 If iAns = 2 then End
 If iAns = 7 then KeepShortParagraphs = false
 CONTINUE:
 If iAns = 6 then
  KeepShortParaGraphs = true : thisVC.gotoStart(false) : thisTC.gotoStart(false)
  If NOT PrevProc then
    Do
     If NOT thisTC.isEndOfParagraph then
      thisTC.gotoEndOfParagraph(true) : cnt = cnt +1
      If Len(thisTC.String) > ParaLen then ParaLen = Len(thisTC.String)
     EndIf
    Loop While thisTC.gotoNextParagraph(false) and cnt < 50 
   Else Do 'Get max length of 1st 50 paragraphs which is probably the
         If NOT thisVC.isAtEndOfLine then 'original right margin.
          thisVC.gotoEndOfLine(true) : cnt = cnt + 1
          If (Len(thisVC.String) > ParaLen) then ParaLen = Len(thisVC.String)
         EndIF
         Loop While thisVC.goRight(1,false) and cnt < 50
  EndIF
  ShortPara = ParaLen - ShortDef 'An arbitary guess of paragraph lengths for
  If ASPL OR ProcSel then            'titles, lists, etc.
  a$ = "You can adjust the short paragraph length, currently set at " & ShortPara
  b$ = ". This is " & ShortDef & " characters short of the estimated right margin of "
  c$ = "the original document based on a sampling of up to 50 lines. 'Cancel' will "
  c$ = c$ & "end the program."
  RunTime = RunTime + (Timer - lTime)
  ShortPara = InputBox(a$ & b$ & c$,"ADJUST SHORT PARAGRAPH LENGTH?",ShortPara)
  lTime = Timer : If ShortPara = "" then End
  ShortPara = Cint(ShortPara): If ShortPara < 1 then ShortPara = 1
  EndIf
 EndIf
End Sub

Sub AskIndents()
 a$ = "0 - Remove nothing." & Chr(13) &"1 - Remove spaced indents." & Chr(13)
 a$ = a$ & "2 - Remove tabbed indents.    3 - Both, or click Cancel to quit."
 Query: RunTime = RunTime + (Timer - lTime)
 sAns = InputBox(a$,"Fix offset paragraphs? (0, 1, 2, 3 or Cancel to quit.)","0")
 lTime = Timer : If sAns = "" then End
 If instr("0123",sAns) = 0 then goto Query
 StripIndents = sAns
 If Cint(StripIndents) > 0 AND KeepShortParagraphs then AskOffset()
End Sub

Sub AskOffset()
a$ = "If you want to convert an offset paragraph(s) to a normal one then the program "
a$ = a$ & "will not keep short paragraphs." & Chr(13) & "Is this what you want to do?"
Runtime = Runtime + (Timer - lTime)
iAns = MsgBox(a$,4,"Convert offset paragraphs?") : lTime = Timer
If iAns = 6 then KeepShortParagraphs = false
End Sub

Sub ProcessOver60K(FandR,thisVC,thisDoc,dispatcher,MarkSel,PageBite)
 'View cursor to entire selection range.
 thisVC.gotoRange(MarkSel.Start,false) : thisVC.gotoRange(MarkSel.End,True)
 LastPage = thisVC.getPage 'Get last page of selection.
 oSelTC = thisDoc.Text.CreateTextCursorByRange(thisVC.Start)
 thisVC.collapseToStart : GetSel = thisVC.getPage : PageBite = PageBite - 1
 Do
  GetSel = GetSel + PageBite 'Get PageBite pages at a time
  If GetSel < LastPage then
    thisVC.jumpToPage(GetSel)'Only the view cursor can jump.
    oSelTC.gotoRange(thisVC,true)
    If Not oSelTC.isEndOfParagraph then
     Do 'Make sure the selection ends at a blank paragraph
      oSelTC.gotoNextParagraph(true)
     Loop Until oSelTC.isEndOfParagraph
    EndIF
   Else LastSection = true : thisVC.gotoRange(MarkSel.End,false)
    oSelTC.gotoRange(thisVC,true)
  EndIf
  thisVC.gotoRange(oSelTC,true)'Only the view cursor can capture text to copy it.
  CopyOrCutIt(thisFrame,"Copy",dispatcher)
  PasteIt(thatFrame,dispatcher)
  RunMainRoutines(FandR)
  oVC.gotoStart(false): oVC.gotoEnd(true)'Get entire selection.
  CopyOrCutIt(thatFrame,"Cut",dispatcher) : PasteIt(thisFrame,dispatcher)
  oSelTC.collapseToEnd   
 Loop While NOT LastSection
 oDoc.dispose(true) : LastSection = false
End Sub

Sub RunMainRoutines(FandR)
CleanUp(FandR)
StripParagraphBreaks(FandR,StripIndents)
End Sub

Sub ForcedLeftMargin(thisTC,FandR)'Delete a forced left margin appearing
thisTC.gotoStart(false) : thisTC.gotoEndOfParagraph(true)'in some OCRed text files.
Margin$ = String(Len(thisTC.String)," ")
If Len(Margin$) > 0 then
 FandR.setSearchString("^" & Margin$)
 Find = oDoc.findFirst(FandR)
 Do While Not IsNull(Find)'Can't use "replaceAll" because
  Find.String = ""'multiple a$s in one line will be deleted.
  If Find.gotoNextParagraph(false) then
    Find = oDoc.findNext(Find.End,FandR)
   Else Exit Do
  EndIf
 Loop
EndIf
End Sub

Sub CleanUp(FandR)
FandR.setSearchString("\n")  'Just in case line breaks got into the doc change
FandR.setReplaceString("\n") 'them to paragraph breaks. This will also help with
oDoc.replaceAll(FandR)       'some text copied & pasted from the web.
FandR.setSearchString(" *$") 'Delete all spaces before paragraph breaks. A must!
FandR.setReplaceString("")
oDoc.replaceAll(FandR)
End Sub

Sub StripParagraphBreaks(FandR,StripIndents)
oTC = oDoc.Text.createTextCursor
oText = oDoc.getText() : s$ = Chr(32) & Chr(9)
Const KM = "¥"
If KeepShortParagraphs then
  If NOT PrevProc then
    FandR.setSearchString(".{" & ShortPara+1 & "}$") 'Find long paragraphs
   Else
    Do
     oTC.gotoEndOfParagraph(false) : oVC.gotoRange(oTC,false)
     oVC.gotoStartOfLine(true)
     If Len(oVC.String) > ShortPara then oVC.String = oVC.String & KM
    Loop While oTC.gotoNextParagraph(false)
  EndIf
 Else FandR.setSearchString(".*$") 'or find all non-blank ones
EndIf                              'and mark them (at rear).
FandR.setReplaceString("&" & KM)
oDoc.replaceAll(FandR)
FandR.setSearchString("^ *") 'How to handle spaced indents.   
If StripIndents = "0" or StripIndents = "2" then
  FandR.setReplaceString(KM & "&") 'Mark them (at front).
 Else FandR.setReplaceString("") 'Strip them.
EndIf
oDoc.replaceAll(FandR)       
FandR.setSearchString("^\t*") 'How to handle tabbed indents.
If StripIndents = "0" or StripIndents = "1" then
  FandR.setReplaceString(KM & "&") 'Mark them.
 Else FandR.setReplaceString("") 'Strip them.
EndIf
oDoc.replaceAll(FandR)           
If LastSection then 'Make sure correct "ending" paragraph is marked,
 oTC.gotoEnd(false) 'otherwise para breaks may be added or stripped.
 If NOT oTC.isStartOfParagraph then 'If last para has text then
   oTC.goLeft(1,true)               'make sure it's not marked.
   If oTC.String = KM then oTC.String = ""
 EndIf
EndIF
oTC.gotoStart(false)
Do 'Find marked paragraph breaks and replace with a space.
 Do While NOT oTC.isEndOfParagraph
  oTC.gotoEndOfParagraph(true)
  If Right(oTC.String,1) = KM then
   oTC.collapseToEnd : oTC.goRight(1,true)
   oTC.String = " " : oTC.CollapseToEnd
  EndIf
 Loop
Loop While oTC.gotoNextParagraph(false)
FandR.setSearchString(" " & KM & " *") 'Find space followed by marker and any
FandR.setReplaceString("\n&") 'trailing spaces which may now be in mid paragraph.
oDoc.replaceAll(FandR)        'Insert break and keep found stuff. 
FandR.setSearchString("^ *" & KM) 'Find 0 or more spaces followed by marker at
FandR.setReplaceString("")        'beginning of paragraph and delete found stuff.
oDoc.replaceAll(FandR)
FandR.setSearchString(" $") 'Replace paragraph ending spaces with a break.   
FandR.setReplaceString("\n")
oDoc.replaceAll(FandR)
oDoc.replaceAll(FandR)
FandR.setSearchString(KM) 'Delete remaining markers.
FandR.setReplaceString("")
oDoc.replaceAll(FandR)
FandR.setSearchString("- ") 'Hyphens found at the right margin are now
If StripHyphens then 'hyphens followed by a space. What should be done?
  FandR.setReplaceString("") 'Delete all and join the word "parts". Default
 Else FandR.setReplaceString("-") 'Delete the space but keep the hyphen.
EndIf
oDoc.replaceAll(FandR)
End Sub

Sub AskRunOptions(AskViewOptions,Show1stOptionSet,Show2ndOptionSet,FandR)
 If NOT AskViewOptions AND NOT ProcSel then Goto Continue
 a$ = Chr(13) & "Would you like to see the additional options?" & Chr(13)
 Runtime = Runtime + (Timer - lTime)
 iAns = MsgBox (a$,4,"MAIN FILE PROCESSING FINISHED") : lTime = Timer
 If iAns = 6 then
   GoToOptions = true
  Else GotoOptions = false 
 EndIf
 CONTINUE:
 If GoToOptions then
  EndIt = FirstOptions(Show1stOptionSet,Show2ndOptionSet,FandR)
  If EndIt then Exit Sub
  SecondOptions(Show2ndOptionSet,FandR)
 EndIf
End Sub

Sub RunOptions(Show1stOptionSet,Show2ndOptionSet,FandR)
 EndIt = FirstOptions(Show1stOptionSet,Show2ndOptionSet,FandR)
 If EndIt then Exit Sub
 SecondOptions(Show2ndOptionSet,FandR)
End Sub
 
Function FirstOptions(Show1stOptionSet,Show2ndOptionSet,FandR)
If NOT Show1stOptionSet AND NOT O1_1 AND NOT O1_2 AND NOT ProcSel then Exit Function
If NOT Show1stOptionSet AND NOT ProcSel then goto Silent1
a$ = "1. Indent all paragraphs." : b$ =  Chr(13) & "2. Reduce paragraph spacing."
c$ = Chr(13) & "3. Continue to other options.     'Cancel' will end the program."
d$ = "FIRST OPTION SET - Chose by number (1, 2 or 3) or 'Cancel' to end program."
ASK: RunTime = RunTime + (Timer - lTime)
sAns = InputBox (a$ & b$ & c$,d$,"3") : lTime = Timer
Select Case sAns
Case "1" : Indent(FandR) : a$ = a$ & " COMPLETED!"
Case "2" : ReduceParaSpacing() : b$ = b$ & " COMPLETED!"
Case "3" : Exit Function
Case ""  : FirstOptions() = true : Exit Function
Case Else : Goto ASK
End Select
Goto ASK
SILENT1:
If O1_1 then Indent(FandR)
If O1_2 then ReduceParaSpacing()
End Function
   
Sub Indent(FandR)
FandR.setSearchString("^[!-" & Chr(255) & "]"'Find paragraphs not starting with space
FandR.setReplaceString("\t&")             ' or tab. Insert tab & keep what was found.
oDoc.replaceAll(FandR)
End Sub

Sub ReduceParaSpacing()
If NOT AskMaxParaSpacing AND NOT ProcSel then ParaSpacing = O1_2_1 : goto Continue
a$ = "Enter the maximum number of blank lines you want between paragraphs."
b$ = Chr(13) & "Zero is a valid entry.              'Cancel' will quit this routine."
ASK: RunTime = RunTime + (Timer - lTime)
ParaSpacing = InputBox (a$ & b$,"ALLOW HOW MANY?",1) : lTime = Timer
If ParaSpacing = "" then Exit Sub
If NOT IsNumeric(ParaSpacing) then goto Ask
ParaSpacing = Cint(ParaSpacing)
CONTINUE: oTC = oDoc.getText().createTextCursor()
oTC.goToStart(false) : bOK = true 'bOK will test for the end of the file.
While oTC.isEndOfParagraph = true and bOK = true 'Skip spacing at top.
bOK = oTC.goRight(1,false)'This will move to the next paragraph from a blank paragraph.             
Wend
bOK = true
Do 'Starting at the beginning of a paragraph.
 cnt = 0
 While oTC.isEndOfParagraph() and bOK       'Is it also the end of a paragraph, i.e,
  cnt = cnt + 1 'Count blank paragraphs.    'a blank paragraph?
   If (cnt > ParaSpacing) then
     oTC.goLeft(1,true) : oTC.String = ""  'Select it and delete it.
     bOK = oTC.goRight(1,false) 'Go right to next paragraph.
    Else
     bOK = oTC.goRight(1,false) 'Move to next paragraph.
   EndIf
 Wend
Loop While oTC.goToNextParagraph(false) 'Another end of file check.
End Sub

Sub SecondOptions(Show2ndOptionSet,FandR)
If NOT Show2ndOptionSet AND NOT O2_1 AND NOT O2_2 AND NOT O2_2 AND NOT ProcSel then Exit Sub
If NOT Show2ndOptionSet AND NOT ProcSel then goto Silent2
a$ = "1. Replace spaced indents with tabs."
b$ = Chr(13) & "2. Remove excess interior spaces."
c$ = Chr(13) & "3. Justify text." : d$ = "       4. Run all."
e$ = "SECOND OPTION SET - Chose by number (1, 2, 3 or 4) or 'Cancel' to end program."
ASK: RunTime = RunTime + (Timer - lTime)
sAns = InputBox(a$ & b$ & c$ & d$,e$," ") : lTime = Timer
Select Case sAns
Case  "" : Exit Sub
Case "1" : ReplaceSpacesB4LinesWithTab(FandR) : a$ = a$ & " COMPLETED!"
Case "2" : DeleteExcessInteriorSpaces(FandR) : b$ = b$ & " COMPLETED!"
Case "3" : Justify() : c$ = c$ & " COMPLETED!" : jstify = true
Case "4" : ReplaceSpacesB4LinesWithTab(FandR)  : jstify = true
           DeleteExcessInteriorSpaces(FandR)
           Justify() : RunTime = Runtime + (Timer - lTime)
           MsgBox "All routines finished." : lTime = Timer : Exit Sub
Case Else : Goto ASK
End Select
GoTo ASK
SILENT2:
If O2_1 then ReplaceSpacesB4LinesWithTab(FandR)
If O2_2 then DeleteExcessInteriorSpaces(FandR)
If O2_3 then Justify()
End Sub

Sub ReplaceSpacesB4LinesWithTab(FandR as Object)
MaxIndent = 10 'Any indent in excess of this will be ignored.
FandR.setSearchString("^ *") 'find any number of spaces at beginning of line
Find = oDoc.findFirst(FandR) 'replace with tab, to replace with nothing use ""
While NOT isNull(Find)
 If Len(Find.String) <= MaxIndent then Find.String = Chr(9)
 Find = oDoc.findNext(Find.End,FandR)
Wend   
End Sub

Sub DeleteExcessInteriorSpaces(FandR as Object)
SM = Chr(165) 'Space marker.
FandR.setSearchString("^ *") 'find any number of spaces at beginning of paragraph
Find = oDoc.findFirst(FandR)
While NOT isNull(Find)
Find.String = String(Len(Find.String),SM) 'replace with placeholders
Find = oDoc.findNext(Find.End,FandR)
Wend
FandR.setSearchString(" *") 'find any number of spaces
FandR.setReplaceString(" ") 'replace with one space
oDoc.ReplaceAll(FandR) 'do it
FandR.setSearchString("^" & SM & "*") 'find any number of placeholders at beginning of line
Find = oDoc.findFirst(FandR) 'turn them back into spaces
While NOT isNull(Find)
Find.String = String(Len(Find.String)," ")
Find = oDoc.findNext(Find.End,FandR)
Wend
End Sub

Sub Justify()
oTC = oDoc.Text.CreateTextCursor()
oTC.gotoEnd(true)
oTC.setPropertyValue("ParaAdjust",2)
End Sub

Sub EndMessage(ShowFinished,MarkIt)
If NOT ShowFinished then Exit Sub
a$ = "Your original document was saved to the Clipboard and can be retrieved from "
b$ = "there if you do not like the macro results. "
RunTime = RunTime + (Timer - lTime)
c$ = "Total processing time was "& RunTime &" Second(s) or "& RunTime/60 &" Minute(s)."
d$ = "If you want to use the normal end of program options run this macro "
d$ = d$ & "again on the file or a selection. "
If MarkIt then
 e$ = "A previously processed marker has been inserted as the last file character. "
EndIf
If NOT Over60K then
  MsgBox (a$ & b$ & e$ & Chr(13) & c$,0,"FINISHED!")
 Else MsgBox (d$ & e$ & Chr(13) & Chr(13) & c$,0,"FINISHED!")
EndIf
End Sub
'Version 2.2  5-6-06


Last edited by JohnV on Mon May 08, 2006 2:48 pm; edited 9 times in total
Back to top
View user's profile Send private message
esperantisto
Super User
Super User


Joined: 26 Dec 2003
Posts: 779
Location: Belarus

PostPosted: Tue Apr 13, 2004 4:47 am    Post subject: Reply with quote

Your macro is great! Well, here are some suggestions to make it greater Smile

1. At start, the macro asks, if paragraphs have indents with spaces (you can specify the number of them) / tabs. And if so, the macro goes after a simple algorithm: a line with space(s) or tab(s) of the specified number at the beginning is considered the first line of a new paragraph.

2. The macro analyses, if the lines contain multiple spaces in order to have the same length. If so, multiple spaces are truncated to single, and the resulting paragraphs are formatted as justified.

I hope, you don't consider me too picky - I can just repeat, your macro is great, but one always wants more Smile
Back to top
View user's profile Send private message
JohnV
Administrator
Administrator


Joined: 07 Mar 2003
Posts: 9183
Location: Lexinton, Kentucky, USA

PostPosted: Sun Apr 25, 2004 4:35 pm    Post subject: Reply with quote

esperantisto,

I haven't been ignoring your post, I have been trying to figure it out. I was stumped by the second part but that finally solved the whole thing for me, I think. Groups of spaces in the middle of lines is a symptom of the conversion of a document that uses a fixed series of spaces before each line to create a left margin. My OCR software does this and it's the reason a wrote the macro in the first place (see my original post). I knew how to easily address this problem for my particular software but had no idea how to check this in other situations so I didn't try to in the original macro.

In the version below I have added a new subroutine to sniff out this situation and if this has been your problem you will see much better results. If this still doesn't solve your problem then please e-mail me a copy of a file that gives you trouble: jcvigor AT earthlink.net

I realize I have made a major assumption about your post and it may mean exactly what it says but I really need to see a sample file or two and run the macro on them to get an idea of any problems your having. In the mean time please test this new code for me to see if it works OK.
Code:
Sub Main
Dim SeparatedParas as Boolean : SeparatedParas = true
oDoc = thisComponent
FileSaver(false)
FandR=oDoc.createReplaceDescriptor() 'Find & Replace initial set up.
FandR.searchRegularExpression = true 'Use regular expressions with F & R
MarginSpaces(oDoc, FandR)
a$ = "Are the normal paragraphs in this document separated by one or more blank lines?"
iAns = MsgBox (a$,35,"Separated paragraphs?") : If(iAns = 2) then End
If (iAns = 7) then SeparatedParas = false
MainRoutine(oDoc, FandR,SeparatedParas)
a$ = "All excess paragraph breaks have been removed. Would you like to indent all "
b$ = "paragraphs or reduce the paragraph spacing? (You can run the macro again "
c$ = "to do either of these later.)"
iAns = MsgBox(a$ & b$ & c$,36,"OPTIONAL FUNCTIONS") : If(iAns = 7) then FileSaver(true): End
iAns = MsgBox("Would you like to indent all paragraphs?",36,"INDENT?")
If(iAns = 6) then Indent(oDoc,FandR)
a$ = "To reduce paragraph spacing enter the maximum number of blank lines between "
b$ = "paragraphs. Zero is valid here. To quit click Cancel."
sAns = InputBox(a$ & b$,"REDUCE PARAGRAPH SPACING?",1)
If (sAns = "") then
  FileSaver(true) : End
Else ParaSpacing = Cint(sAns)
EndIf
ReduceParaSpacing(oDoc,ParaSpacing)
FileSaver(true)
End Sub

Sub FileSaver(Finish as Boolean) 'Set the application extension for new file:
Static WasPath : Static SavePath :  AppExt = ".sxw" '<=======================
GlobalScope.BasicLibraries.LoadLibrary("Tools")
If (Finish) then goto Final
WasPath = convertFromURL(thiscomponent.URL) : WasCopy = WasPath
If (WasPath = "") then
WorkURL = GetPathSettings("Work") : WorkPath = convertFromURL(WorkURL) + "\"
Waspath = InputBox("Enter a file name or full path","UNSAVED FILE",WorkPath)
If (Instr(WasPath,".") = 0) then WasPath = WasPath + AppExt
If (Instr(convertToURL(WasPath),"/") = 0) then WasPath = WorkPath + WasPath
EndIf
While Mid(convertToURL(WasPath),Len(convertToURL(WasPath))-NAMElon,1) <> "/"
NAMElon = NAMElon + 1
Wend           
Do
If Mid(WasPath,Len(WasPath)-NAMEcnt,1) = "." then
  NAMEext = Right(WasPath,NAMEcnt+1) : NAMEloe = Len(NAMEext) : Exit Do
EndIf
NAMEcnt = NAMEcnt + 1
Loop While NAMEcnt < NAMElon
SavePath = WasPath : If (Right(SavePath,NAMEloe) <> AppExt) then SavePath = SavePath + AppExt
Do
WasPath = Left(WasPath,Len(WasPath)-NAMEloe) + ".B4" + NAMEext
Loop While FileExists(WasPath)
If (FileExists(SavePath) and SavePath <> WasCopy) then
iAns = MsgBox("OK to OVERWRITE: " & Chr(13) & SavePath & " ?",36,"EXISTING FILE")
If (iAns <> 6) then End
EndIf
thisComponent.StoreToURL(convertToURL(WasPath),Array())
Exit Sub
Final:      thisComponent.StoreAsURL(convertToURL(SavePath), Array())
a$ = "Your original file was saved as:" & Chr(13) & WasPath & Chr(13)
b$ = "Your new file was saved as:" & chr(13) & SavePath : MsgBox a$ & b$
oVC = thisComponent.GetCurrentController().GetFrame().GetController().getViewCursor()
oVC.gotoStart(false)'Get control of the view cursor and put it at the beginning of file.
End Sub

Sub MainRoutine(oDoc,FandR,SeparatedParas) 'Remove excess paragraph breaks.
FandR.setSearchString(" *$") 'Delete all spaces before paragraph breaks.
FandR.setReplaceString("")
oDoc.replaceAll(FandR)
FandR.setSearchString("\n")  'Just in case line breaks got into the doc change them
FandR.setReplaceString("\n") 'to paragraph breaks. This enables working with some things
oDoc.replaceAll(FandR)       'copied from the web. Looks strange but this is how it's done.
oText = oDoc.getText()
oVC = oDoc.CurrentController.getViewCursor()'Get the view
oTC = oText.createTextCursor() 'cursor and a text cursor.
oVC.gotoStart(false)
While oTC.isEndOfParagraph 'delete blank paras. at top
 oTC.goRight(1,true) : oTC.String = ""
Wend 
cnt = 0 : ParaLen = 0
While oVC.goDown(1,false) and cnt < 20 'Find maximum paragraph length of 1st 20
oVC.gotoEndOfLine(true)                'lines. The original right margin.
cnt = cnt + 1
If (Len(oVC.String) > ParaLen) then ParaLen = Len(oVC.String)
oVC.gotoStartOfLine(false)
Wend
ShortPara = ParaLen - 12'An arbitary guess of paragraph lengths for titles, lists, etc. and
oTC.goToStart(false) : oVC.gotoStart(false)      'this will also catch all blank paragraphs.
Do 'Mark short paragraphs with ¥. All paragraphs we want to keep will be marked with ¥.
oTC.gotoEndOfParagraph(true)
If (Len(oTC.String) < ShortPara) then
  oTC.collapseToEnd() : oTC.goLeft(1,true)
  oTC.collapseToEnd()
  oText.insertString(oTC.getStart(),"¥",false)
EndIF
Loop While oTC.gotoNextParagraph(false)
FandR.setSearchString("¥$")'Find the 1st para we marked above with ¥.
FindCursor = oDoc.findFirst(FandR)'FindCursor is a text cursor with found item selected.
While Not IsNull(FindCursor)
  oTC.goToRange(FindCursor,false)   'Move the text cursor to this position.
  If(oTC.gotoPreviousParagraph(false) = true) then'Move text cursor to previous para.
    oTC.gotoEndOfParagraph(false)'Then move it to end of that paragraph.
   If (oTC.goLeft(1,true) = true )then 'Get the last character of this paragraph.
    char = oTC.getString()
    If (char <> "¥") then 'If not marked then should we mark it?
     If(Instr(".?!:",char) > 0) then 'A guess that paragraph break should be kept.
      oText.insertString(oTC.getEnd(),"¥",false) 'Mark it.
     EndIf
    EndIf
   EndIf
  EndIf
  FindCursor = oDoc.findNext(FindCursor.End,FandR)
Wend     
If (SeparatedParas = false) then
FandR.setSearchString("[.?!]$")   'Mark all paragraphs ending with these items.
  FandR.setReplaceString("&¥")     'Some sentences WITHIN a paragraph will naturally end
  FandR.SearchCaseSensitive = true 'at the right margin and this will cause an incorrect
  oDoc.replaceAll(FandR)           'selection when this occurs, i.e., this isn't really
EndIf                              'the end of a paragraph.
FandR.setSearchString("$")   'Replace all paragraph breaks with §.
FandR.setReplaceString("§") 'A para break we want to keep now is marked ¥§.
oDoc.replaceAll(FandR)      'A para break we don't want is only marked §.
FandR.setSearchString("¥§")'Restore breaks to the keepers.
FandR.setReplaceString("\n")
oDoc.replaceAll(FandR)
FandR.setSearchString("-§")'Delete end of line hyphenation.
FandR.setReplaceString("")
oDoc.replaceAll(FandR)   
FandR.setSearchString("§")'Replace the remaining markers with a space.
FandR.setReplaceString(" ")
oDoc.replaceAll(FandR)
oVC.gotoEnd(false)'Clean up a left over ¥.                             
oVC.goLeft(1,true)
oVC.String = "" : oVC.gotoStart(false)
End sub

Sub Indent(oDoc,FandR)
FandR.setSearchString("^[:alnum:]?")'Find paragraphs not starting with space or tab.
FandR.setReplaceString("\t&")'Insert tab & keep what was found.
oDoc.replaceAll(FandR)
End Sub

Sub ReduceParaSpacing(oDoc,ParaSpacing)
oTC = oDoc.getText().createTextCursor()
oTC.goToStart(false) : bOK = true 'bOK will test for the end of the file.
While oTC.isEndOfParagraph = true and bOK = true 'Skip spacing at top.
bOK = oTC.goRight(1,false)'This will move to the next paragraph from a blank paragraph.             
Wend
bOK = true
Do 'Starting at the beginning of a paragragh.
cnt = 0
While oTC.isEndOfParagraph() = true and bOK = true 'Is it also the end of a paragraph, i.e,
  cnt = cnt + 1 'Count blank paragraphs.            'a blank paragraph?
   If (cnt > ParaSpacing) then
     bOK = oTC.goRight(1,true) 'Select it (Cursor move returns 'false' to bOK at end of file).
     oTC.String = "" 'Delete it and in effect move to next paragraph.
    Else
     bOK = oTC.goRight(1,false) 'Move to next paragraph.
   EndIf
Wend
Loop Until oTC.goToNextParagraph(false) = false 'Another end of file check.
End Sub

Sub MarginSpaces(oDoc,FandR) 'determine if the left margin is represented by a fixed series of
'spaces before each line of text. This will occur with some OCR software. Mine is a version
'by ScanSoft which does this and "defines" it in the first line as a paragraph of spaces.
oTC = oDoc.Text.createTextCursor() : oTC.gotoStart(false)
oTC.gotoEndOfParagraph(true) : a$ = oTC.String 'check for "definition" in the 1st line
If Len(a$) > 0 and Len(a$) < 20 then
  b$ = String(Len(a$)," ")'Check for para. break preceeded by spaces only
  If a$ = b$ then LenBlank = Len(a$): GoTo RemoveMargin 'yes it's defined
EndIf
'print "going to test 2" 
oText = oDoc.getText() 'Assume it may not be "defined" in the 1st line but still exists
oTC = oDoc.Text.createTextCursor() 'Prepare to run a test on the 1st page only
'xray.xray otc:end
oVC = oDoc.CurrentController.getViewCursor()
oVC.gotoStart(false) : oVC.jumpToEndOfPage()'as stop flag
FandR.setSearchString("^ ")'Look for paras. starting with a space
FindCursor = oDoc.findFirst(FandR) : Cnt = 0
Do While NOT isNull(FindCursor)
 FindCursor.gotoEndOfParagraph(true)
 Cnt = Cnt + 1
 If (oText.compareRegionEnds(FindCursor,oVC) < 0) then Exit Do
 GoSub BlankPara
 If bBlank = true and BlankCnt > 2 then GoTo RemoveMargin 'Found 3 qualifying para. breaks
FindCursor = odoc.findNext(FindCursor.End,FandR)
Loop
'print "Going to test 3"
'If doc doesn't use blank paras. to seperate normal paras. we may need yet a further test.
Do                       
Lines = Lines + 1 'Count the lines on the first page.
Loop while oVC.goUp(1,false)
If Cnt/Lines < .30 then Exit Sub '30% should be proof enough but if LenBlank has a value, we
Dim LastCnt(19)                  'are not very confident it's right so compute another way.
oVC.jumpToEndOfPage(false)
FindCursor = oDoc.findFirst(FandR) : last = 0
Do While Not isNull(FindCursor)
 FindCursor.gotoEndOfParagraph(true) : a$ = FindCursor.String
  Do
   last = last + 1
  Loop While Mid(a$,last,1) = " "
 If last < 20 then LastCnt(last) = LastCnt(last) + 1
 last = 0
 If (oText.compareRegionEnds(FindCursor,oVC) < 0) then Exit Do
 FindCursor = oDoc.findNext(FindCursor.End,FandR)
Loop 
For x = 1 to 19
 If LastCnt(x) > BigCnt then BigCnt = LastCnt(x) : LenBlank = x - 1
Next x
GoTo RemoveMargin
'////////////////////////////////BlankPara\\\\\\\\\\\\\\\\\\\\\\
BlankPara: 'Is this a para. break preceeded by spaces only?
oTC.gotoRange(FindCursor,false)
oTC.gotoEndOfParagraph(true) : a$ = oTC.String
If Len(a$) > 0 and Len(a$) < 20 then
  b$ = String(Len(a$)," ")
  If a$ = b$ then
    bBlank = true : BlankCnt = BlankCnt + 1
    If Len(a$) > LenBlank then LenBlank = Len(a$)
  EndIf
EndIf 
Return
'////////////////////////////////RemoveMargin\\\\\\\\\\\\\\\\\\\
RemoveMargin:
a$ = "This document appears to use a fixed series of " & LenBlank & " spaces before "
b$ = "each text line which" & Chr(10) & "are used to create a left margin. "
c$ = "Cancel and toggle Ctrl+F10 if you want to view them." & Chr(13) & Chr(13)
d$ = "Delete them? (Highly recommended.)"
iAns = MsgBox (a$ & b$ & c$ & d$,3,"left margin spacing appears to be in use.")
If iAns = 2 then
 End 'Cancel
 ElseIf iAns = 7 then Exit Sub 'No
 Else iAns = "" 'Yes-clear variable
EndIf
LM$ = String(LenBlank," ") 'Imitate left margin
FandR.setSearchString("^" & LM$) 'Delete all occurences
FindCursor = oDoc.findFirst(FandR)'Note "replaceAll(FandR)" deletes too much. This routine
While NOT isNull(FindCursor)   'has to be limited to 1 delete per paragraph or it will too.
 FindCursor.String = ""
 FindCursor.gotoEndOfParagraph(false) 'Only 1 per paragraph.
 FindCursor = oDoc.findNext(FindCursor.End,FandR)
Wend 
End Sub
Back to top
View user's profile Send private message
JohnV
Administrator
Administrator


Joined: 07 Mar 2003
Posts: 9183
Location: Lexinton, Kentucky, USA

PostPosted: Sat May 01, 2004 10:23 am    Post subject: Reply with quote

esperantisto,

I see that my assumption was wrong (not the 1st time) and you wanted to do pretty much what you said. I have received your before and after files. I think this is a bit much to add to the macro which is long as it is but here is another macro that will do what you want. I have divided it into 3 separate tasks which if all preformed in order will accomplish what I think you want. You have a choice to run each separately or all at once.

You can incorporate the code into the main macro by simply pasting this macro at the end of the existing code and adding, in the Main (1st) subroutine, “ConvertOptions” in the position shown below:

MainRoutine(oDoc, FandR,SeparatedParas)
ConvertOptions
a$ = "All excess paragraph breaks have been removed. Would you like to indent all "

This puts it sort of out of order but works and involves very little editing of the code.

The first routine finds paragraphs indented with spaces and converts the spaces to a tab. If you would prefer to just eliminate the spaces and control the indent with a paragraph style then just change Chr(9) to “”. There is actually a 4th routine that isn't used but you could activate it. This routine replaces all spaces before any line with a tab. It has a MaxIndent value that you can adjust. This is meant to keep it from eliminating things that might be centered with spaces.

The 2nd routine eliminates multiple spaces in the interior of the document and the 3rd routine does the justification.
Quote:
1. At start, the macro asks, if paragraphs have indents with spaces (you can specify the number of them) / tabs. And if so, the macro goes after a simple algorithm: a line with space(s) or tab(s) of the specified number at the beginning is considered the first line of a new paragraph.
Does this comment mean that you find that the macro is missing the beginning of paragraphs?

I know it sometimes thinks something is the end of a paragraph when it isn't so it actually splits a single paragraph into two separate ones. This a particularly true when the paragraphs are not seperated by a blank line as is the case of your Sample.txt file. Is this what you are addressing?
Code:
Sub ConvertOptions
oDoc = thisComponent
FandR=oDoc.createReplaceDescriptor() 'Find & Replace initial set up.
FandR.searchRegularExpression = True 'Use regular expressions
Repeat:
a$ = "1. Replace spaced indents with tabs." & Chr(13)
b$ = "2. Remove excess spaces." & Chr(13)
c$ = "3. Justify text.                    4. Run all."
sAns = InputBox(a$ & b$ & c$,"Select an option by number or Cancel."," ")
Select Case sAns
 Case  "" : Exit Sub
 Case "1" : AltParaIndent(oDoc, FandR)  'ReplaceSpacesB4LinesWithTab(oDoc, FandR)
 Case "2" : DeleteExcessInteriorSpaces(oDoc, FandR)
 Case "3" : Justify(oDoc)
 Case "4" : AltParaIndent(oDoc, FandR) : DeleteExcessInteriorSpaces(oDoc, FandR)
            Justify(oDoc) : MsgBox "All routines finished." : Exit Sub
 Case Else : MsgBox "Select 1, 2, 3 or 4 and OK , or Cancel to quit." : Goto Repeat
End Select
iAns = MsgBox ("That is finished!",1,"Cancel if you are done.") : If iAns = 2 then End
GoTo Repeat
End Sub

Sub AltParaIndent(oDoc as Object,FandR as Object)
MaxIndent = 10 'Any indent longer than this will be ignored.
oTC = oDoc.Text.CreateTextCursor()
Do
 If NOT oTC.isEndOfParagraph() then 'will skip blank paragraghs
  oTC.goRight(1,true) : cnt = 0 'ok, not a blank one so get the spaces, if any, and
  While Right(oTC.String,1) = " " and cnt < MaxIndent
   cnt = cnt +1 : bSpace = true : oTC.goRight(1,true)
  Wend
 EndIf
 If bSpace and cnt < MaxIndent - 1 then oTC.goLeft(1,true) : oTC.String = Chr(9) 'convert
 bSpace = false               'them to a tab. To delete them instead, change Chr(9) to ""
Loop While oTC.gotoNextParagraph(false)
oVC = oDoc.CurrentController.getViewCursor() : oVC.gotoStart(false)
End Sub

Sub ReplaceSpacesB4LinesWithTab(oDoc as Object, FandR as Object)
FandR.setSearchString("^ *") 'find any number of spaces at beginning of line
FandR.setReplaceString(Chr(9))  'replace with tab, to replace with nothing use ""
oDoc.ReplaceAll(FandR) 'do it 
End Sub

Sub DeleteExcessInteriorSpaces(oDoc as Object, FandR as Object)
 FandR.setSearchString("^ *") 'find any number of spaces at beginning of line
 FindCursor = oDoc.findFirst(FandR)
 While NOT isNull(FindCursor)
  'a$ = String(Len(findCursor.String),"`")
  'print a$
  FindCursor.String = String(Len(FindCursor.String),"&#9580;")
  FindCursor = oDoc.findNext(FindCursor.End,FandR)
 Wend 
 FandR.setSearchString(" *") 'find any number of spaces except those following &#9580;.
 FandR.setReplaceString(" ")  'replace with one space
 oDoc.ReplaceAll(FandR) 'do it 
 FandR.setSearchString("&#9580;") 'find any number of placeholders at beginning of line
 FandR.setReplaceString(" ")  'turn them back into spaces
 oDoc.ReplaceAll(FandR) 'do it
End Sub

Sub Justify(oDoc as Object)
oTC = oDoc.Text.CreateTextCursor()
oTC.gotoEnd(true)
oTC.setPropertyValue("ParaAdjust",2)
End Sub
Back to top
View user's profile Send private message
esperantisto
Super User
Super User


Joined: 26 Dec 2003
Posts: 779
Location: Belarus

PostPosted: Sun May 02, 2004 11:32 pm    Post subject: Reply with quote

JohnV wrote:

You can incorporate the code into the main macro by simply pasting this macro at the end of the existing code and adding, in the Main (1st) subroutine, ?ConvertOptions? in the position shown below:

MainRoutine(oDoc, FandR,SeparatedParas)
ConvertOptions
a$ = "All excess paragraph breaks have been removed. Would you like to indent all "



Should I use the 1st or the 2nd variant? Anyway, with both versions, trying to run the amended macro caused exception at the line "GlobalScope.BasicLibraries.LoadLibrary("Tools")" in FileSaver sub. Then OOo just crashed at any new attempt. In fact, this may be some OOo's bug - I tried the macro on 680 m26. I'll try the thing also on 1.1.1 and report.

Quote:
Quote:
1. At start, the macro asks, if paragraphs have indents with spaces (you can specify the number of them) / tabs. And if so, the macro goes after a simple algorithm: a line with space(s) or tab(s) of the specified number at the beginning is considered the first line of a new paragraph.
Does this comment mean that you find that the macro is missing the beginning of paragraphs?

I know it sometimes thinks something is the end of a paragraph when it isn't so it actually splits a single paragraph into two separate ones. This a particularly true when the paragraphs are not seperated by a blank line as is the case of your Sample.txt file. Is this what you are addressing?


Exactly! I address broken paragpraghs - I think, a straigtforward approach would produce quite good results. I imagined something like this (assuming, I read text from a text file and write the result to another one and putting aside some details, including constructors; and sorry for my poor Java - but this the only language, with which I can write at least something Smile):

Code:

BufferedReader br = new BufferedReader(...);
BufferedWriter bw = new BufferedWriter(...);
String s ="", r ="";
while((s = br.readLine()) != null){
  if(s.startsWith("    ")){
    bw.write(r + "\n");
    r = s;
  }
  else{
    r = r + " " + s;
  }
}
br.close();
bw.close();


Of course, this code is a crude thing (it doesn't address hyphenation, i.a.) - I just want to show the way.
Back to top
View user's profile Send private message
esperantisto
Super User
Super User


Joined: 26 Dec 2003
Posts: 779
Location: Belarus

PostPosted: Sun May 02, 2004 11:46 pm    Post subject: Reply with quote

I've just tried the macro on 1.1.1. Well, now I can only say: YOU ARE GREAT!
Back to top
View user's profile Send private message
thepletts
General User
General User


Joined: 17 Feb 2004
Posts: 11

PostPosted: Fri May 21, 2004 7:00 pm    Post subject: nice start for gutenberg ebook "cleaning" Reply with quote

Thanks for this macro. It didn't work perfectly for cleaning up a project Gutenberg etext, but it got some good rough work done which made subsequent cleaning a lot easier.

Daniel
Back to top
View user's profile Send private message
Prothall
General User
General User


Joined: 05 Sep 2004
Posts: 15

PostPosted: Mon Sep 06, 2004 5:32 am    Post subject: Reply with quote

I love this macro, but I've got one little problem...

For some reason, it seems to cut off about half of the text I applied it to. It was quite long, but it's rather strange. I was using the text of the Project Gutenberg "Fifty-One Tales" by Lord Dunsany, if you care to test it.
Back to top
View user's profile Send private message
JohnV
Administrator
Administrator


Joined: 07 Mar 2003
Posts: 9183
Location: Lexinton, Kentucky, USA

PostPosted: Mon Sep 06, 2004 6:03 pm    Post subject: Reply with quote

Prothall,

The version of the macro you are using will drop text in excess of 64K characters. Daniel Plett brought this problem to my attention. I have fixed this problem and reposted the new code by editing my original post in this thread. It actually makes reference to Project Gutenberg and has been tested with some of those downloads.

If you find any problems with this new version just let me know.
Back to top
View user's profile Send private message
Prothall
General User
General User


Joined: 05 Sep 2004
Posts: 15

PostPosted: Tue Sep 07, 2004 12:08 pm    Post subject: Reply with quote

Thank you very much, sir.
Back to top
View user's profile Send private message
Prothall
General User
General User


Joined: 05 Sep 2004
Posts: 15

PostPosted: Wed Oct 13, 2004 5:47 pm    Post subject: Reply with quote

I appear to have developed a problem. While the macro has worked, it now has a nasty habit of killing OpenOffice after replacing most returns with the odd symbols. My computer does have some issues, but none of them are related to this (I believe). I can't test this on another machine at the moment. Is anyone else having this problem? I am using a rather long document, just for reference.
Back to top
View user's profile Send private message
esperantisto
Super User
Super User


Joined: 26 Dec 2003
Posts: 779
Location: Belarus

PostPosted: Wed Oct 13, 2004 9:50 pm    Post subject: Reply with quote

I've used the macro quite extensively, on 1-3 Mb documents as well, and have never experienced any problem (well, except that once it took over 1 hour to clean up a document Very Happy )
Back to top
View user's profile Send private message
JohnV
Administrator
Administrator


Joined: 07 Mar 2003
Posts: 9183
Location: Lexinton, Kentucky, USA

PostPosted: Fri Dec 17, 2004 6:59 pm    Post subject: Reply with quote

The latest version of the macro is in my orginal post. I have edited it today to add back in the elimination of hyphens occurring at the right margin. Normally these would be "editorial" (for lack of a better word) hyphens used to split a word between two lines instead of those used in a true hyphenated word like "half-dollar".

The default (controlled by the variable StripHyphens in the Main routine) is to eliminate these. You can change this behavior, as well as other items found there, by changing the "true" value to "false" (without the quotes).

By oversite, when I rewrote the macro to handle text over 64K I omitted this routine. If anyone experiences problems with this version please let me know.

In reference to the last two posts above, the macro can take forever on long items like Project Gutenberg eBooks. In the introductory comments at the beginning of the macro you can find some time tests I ran to give you an idea and a hint to determine if the macro is actually running or your computer is "hung". I have not experienced a "hung" computer in my testing but you might want to do something more interesting than watch the screen while running this on a big document.
Back to top
View user's profile Send private message
goa103
OOo Advocate
OOo Advocate


Joined: 11 May 2003
Posts: 279

PostPosted: Fri Feb 25, 2005 7:31 am    Post subject: Re: Convert ASCII text by eliminating extra paragraph breaks Reply with quote

JohnV wrote:
Code:
FandR.setSearchString("\n")   'Just in case line breaks got into the doc change
FandR.setReplaceString("\n") 'them to paragraph breaks. This will also help with
oDoc.replaceAll(FandR)       'some text copied & pasted from the web.


It seems a very powerful macro but I only need a simple search paragraph & replace by line breaks macro. I thought searching for $ and replacing them with \n would work, but it doesn't. So I checked your macro and really wonder what the above code does. Why searching for \n and replacing them by \n would do anything special ? Is it some kind of guru trick ? Smile.
_________________
An OOo mascot designer
Back to top
View user's profile Send private message
JohnV
Administrator
Administrator


Joined: 07 Mar 2003
Posts: 9183
Location: Lexinton, Kentucky, USA

PostPosted: Fri Feb 25, 2005 7:11 pm    Post subject: Reply with quote

It's no "guru trick". For reasons that are beyond me, when you use find & replace \n represents a line break if used as a regular expression in the Find box but represents a paragraph break if used in the Replace box. So all I am doing with this code is finding any line breaks and replacing them with paragraph breaks.

You can check this out by put some line break in a doc with Shift+Enter and using find & replace with the regular expression box checked. Of course, you need to toggle Ctrl+F10 on so you can tell the difference between the two.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Code Snippets All times are GMT - 8 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group