OpenOffice.org Forum at OOoForum.orgThe OpenOffice.org Forum
 
 [Home]   [FAQ]   [Search]   [Memberlist]   [Usergroups]   [Register
 [Profile]   [Log in to check your private messages]   [Log in

Matching partial strings in 2 columns of text data?

 
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Calc
View previous topic :: View next topic  
Author Message
MontyJ
General User
General User


Joined: 20 Jul 2012
Posts: 10

PostPosted: Thu Aug 16, 2012 8:37 pm    Post subject: Matching partial strings in 2 columns of text data? Reply with quote

I have two columns of string (text) data, A and B.

I want to check for the possibility of a 'probable' match of cells in Col B matching any cell in Col A.

Example of contents-
------------------------------------------
Cells A1:A5 example:

A Tale Of A Tub - Jonathan Swift
A Tale Of Two Cities - Charles Dickens
A Thousand Splendid Suns - Khaled Hosseini
A Thousand Suns - Alex Scarrow
A Time To Betray - Reza Kahlili

Cells B1:B2 example:

Thousand Splendid Suns, A`Khaled Hosseini
A thousand suns - Scarrow, Alex
-------------------------------------------

In this example, you can see that the two entries in Col B are 'probable' matches (duplicates) to two of the entries in Col A. The ultimate goal is to add non-duplicate entries in Col B to Col A, so I need to somehow flag or otherwise identify those that are duplicates.

I need some way to eliminate the laborious manual searching I now do to determine whether a given entry in Col B is 'probably' a match for a given cell in Col A. Col A actually has over 25,000 entries, while Col B may have 100 or so when I check a new 'batch' of entries to be added.

I have looked at many 'examples' of using INDEX with MATCH, SEARCH, LOOKUP and FIND examples, and tried some of them for the example above, but so far I have no results even close.

As the examples show, the book title may have different case, the "The" and "A" at the beginning of a title may be put at the end after a comma, and the author's name may be reversed with a comma. There are other format issues, but these are the most common.

Is there a way to setup and search within a string array for a match of more than one word? If it comes to it, I could enter my search terms manually. For example, using the B1 entry above I could manually enter:

C1= "Khaled"
D1 = "Hosseini"
E1 = "Splendid"
F1 = "Suns"

If I could then search within the A1:A5 array for a cell that contained all those keywords in any order, with any case, then I know I have a probable match! But I am lost on how to do that, or even if it can be done. This would still be more time consuming than I would like, but....

I would prefer to simply paste in my Col B list of entries and perhaps adjust the B1:B? array range and go, but perhaps that is asking too much of Calc?

In either case, any pointers on how I can make this work?

Thanks!

Monty
Back to top
View user's profile Send private message
Robert Tucker
Moderator
Moderator


Joined: 16 Aug 2004
Posts: 3367
Location: Manchester UK

PostPosted: Fri Aug 17, 2012 2:06 am    Post subject: Reply with quote

Code:
=AND(ISNUMBER(SEARCH($C$1;A1;1));ISNUMBER(SEARCH($C$2;A1;1));ISNUMBER(SEARCH($C$3;A1;1));ISNUMBER(SEARCH($C$4;A1;1)))


If you paste this in a column and fill down it will return TRUE for a match and FALSE for no match, like this:



You might really want to look at fuzzy matching:

http://en.wikipedia.org/wiki/Fuzzy_string_searching

The Levenshtein distance is used in many software applications called CAT (computer-aided translation) tools. If you could create a text file with all the entries in it you want to check against you could use one of these tools (e.g. OmegaT) to create a TMX which would just have the original text as a translation of itself. You could then use this TMX to find fuzzy matches for items in your new list. However, I don't think it would very useful if the order of the words is jumbled up significantly.
_________________
LibreOffice 3.6.6 on Fedora 18, LibreOffice 4.0.2 on Ubuntu 13.04 (Double Boot)


Last edited by Robert Tucker on Fri Aug 17, 2012 11:57 pm; edited 1 time in total
Back to top
View user's profile Send private message
MontyJ
General User
General User


Joined: 20 Jul 2012
Posts: 10

PostPosted: Fri Aug 17, 2012 8:14 am    Post subject: Reply with quote

Robert,

Thanks for those tips! It will take me a while to digest your code, but hopefully it is something that can be expanded on.

In your example, I would need ALL of the results in E1 thru E4 to be "True", as all of those words are found in A3. But your layout would be a good way to easily see the 'probability' of a match if say 2 or 3 of the keywords matched (were "True").

As for "fuzzy logic", it might be helpful, but since I am searching for specific, whole words, I would think pursuing "string searching" would be a more practical approach using Calc. Yes the words are 'jumbled up', but the spelling of those words is reliable enough to give me a high probability of a good "hit" if we could make all 4 of those keywords in this example = "True"!

Do you see any way to do that with your approach?? I will dig into it later today, as I have an urgent dental appointment in few hours to solve a problem that is keeping me from thinking all that clearly, LoL.

Again, thanks!

Monty
Back to top
View user's profile Send private message
Robert Tucker
Moderator
Moderator


Joined: 16 Aug 2004
Posts: 3367
Location: Manchester UK

PostPosted: Fri Aug 17, 2012 9:28 am    Post subject: Reply with quote

MontyJ wrote:
In your example, I would need ALL of the results in E1 thru E4 to be "True", as all of those words are found in A3. But your layout would be a good way to easily see the 'probability' of a match if say 2 or 3 of the keywords matched (were "True").

I should probably have explained more clearly that in my example all four words "Khaled", "Hosseini", "Splendid", "Suns" had to be present in one of the A column cells for there to be a "TRUE" in column E – I just arranged them vertically instead of in a row C1:E1 as you considered. Hopefully that will be clear now you have (hopefully painlessly!) attended your dental appointment.

If your spelling is sufficiently near 100% accuracy for your needs then I guess this string search method may be enough. Otherwise I suspect some Levenshtein distance calculation will be necessary on strings compared.
_________________
LibreOffice 3.6.6 on Fedora 18, LibreOffice 4.0.2 on Ubuntu 13.04 (Double Boot)
Back to top
View user's profile Send private message
MontyJ
General User
General User


Joined: 20 Jul 2012
Posts: 10

PostPosted: Fri Aug 17, 2012 2:13 pm    Post subject: Reply with quote

Robert,

'Fhanx 'or ah 'elp!, LoL. That is what I sound like if I verbalize "Thanks for the help" with my lip feeling about the size of a basketball!

Now I get what you are doing! I had some weird results but found the problem to be a typo in the last search term; it should have been (SEARCH($C$4;A1;1)), not (SEARCH($C$1;A1;1)).

I then simply did a SUM of Col E and know how many dupes are in the list. Next step will be to VLOOKUP of all the "True" hits and return me the row number(s) so I don't have to scroll down the entire list to find them all! Actually a whole array (pun intended!) of options now present themselves, thanks to your help.

Monty
Back to top
View user's profile Send private message
Robert Tucker
Moderator
Moderator


Joined: 16 Aug 2004
Posts: 3367
Location: Manchester UK

PostPosted: Sat Aug 18, 2012 12:06 am    Post subject: Reply with quote

MontyJ wrote:
it should have been (SEARCH($C$4;A1;1)), not (SEARCH($C$1;A1;1))

Thanks. Edited it.
_________________
LibreOffice 3.6.6 on Fedora 18, LibreOffice 4.0.2 on Ubuntu 13.04 (Double Boot)
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    OOoForum.org Forum Index -> OpenOffice.org Calc All times are GMT - 8 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group