LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 09-18-2012, 06:58 PM   #1
Ystack
LQ Newbie
 
Registered: Dec 2005
Posts: 20

Rep: Reputation: 0
'STRINGS' and 'GREP' INCONSISTENCY MYSTERY


The 'strings' command can look for string matches in binary files. I use it to look through spreadsheets in '.xls' format with some success, but not consistent success.

Narrowing it down I have found that any line with a "--" (double-dash) in it may not be returned, ie not identified as a string.

However, If I edit the spreadsheet changing the "--" to ":" then those lines are correctly string-matched & returned.

But, on re-editing the spreadsheet changing the ":" back to "--" then the lines are correctly string-matched & returned.

(btw the 'grep' command will also now succeed similarly on that spreadsheet where it failed previously).

No doubt this has something to do with the special nature of "--" to the Bash shell, but I am thinking there must be some meta setting involved here which would produce consistent success, but what? The hundreds of spreadsheets I need to search through frequently all have many, many lines with "--" (double-dash) in them.

ps. The search-term I am trying to match often occurs in a string containing "--".

Last edited by Ystack; 09-18-2012 at 07:03 PM.
 
Old 09-19-2012, 02:39 AM   #2
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 1,811

Rep: Reputation: 502Reputation: 502Reputation: 502Reputation: 502Reputation: 502Reputation: 502
From your words I guess this is what doesn't work:

Code:
FIND="--"; string sg.xls | grep "$FIND"
or this:

Code:
FIND="-v"; string sg.xls | grep "$FIND"
Try this one:

Code:
FIND="--"; string sg.xls | grep -- "$FIND"
 
Old 09-19-2012, 04:54 PM   #3
Ystack
LQ Newbie
 
Registered: Dec 2005
Posts: 20

Original Poster
Rep: Reputation: 0
Question

Thanks. Unfortunately I don't have a known-to-fail example to hand to try out since the two I had got 'fixed' via file-editing (as aforementioned) while working on the issue.

FIND="--"; strings sg.xls | grep -- "$FIND"

is not likely to succeed on problem cases, however, because

strings sg.xls

returns all string-lines EXCEPT THOSE CONTAINING "--" in the problem cases, (unless and until I have manually edited/re-edited strings.xls as described above). So grep will not even get to see the problem lines when there is a problem.

The inconsistency makes it difficult to locate or reproduce problem cases, but when I find another one I'm sure I will be able to confirm this with absolute certainty. It only arises when I do a search that I already know (from the outside, so to speak) should match in a certain file but fails to do so and, of course, I am not ever searching to actually match "--", but rather string stuff alongside it (ie in the same line in which the "--" features).

Sorry, not sure what the markup for <code> is this forum.
 
Old 09-19-2012, 05:15 PM   #4
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,395
Blog Entries: 2

Rep: Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903
How are you certain that strings 'returns all string-lines EXCEPT THOSE CONTAINING "--"'. Strings simply returns sequences of printable characters. It doesn't do anything line oriented. Since you are scanning spreadsheets, it is possible that the internal storage format of the spreadsheet file treats the "--" sequence specially. It is also possible that strings contained in the spreadsheet are not stored as contiguous strings within the file (although it is likely that short strings are stored contiguously).

--- rod
 
Old 09-19-2012, 06:12 PM   #5
Ystack
LQ Newbie
 
Registered: Dec 2005
Posts: 20

Original Poster
Rep: Reputation: 0
Example:

case-X:
sg.xls has these cells + there is a problem with GREP (/STRINGS) commands:
(A1)"Hacker" (B1)"Main -- Subs" (C1)"Doc"
#> grep 'Hacker' --> Binary File sg.xls matched
#> grep 'Subs' --> ..........................[no matches found]..
#> grep 'Doc' --> Binary File sg.xls matched
#> strings sg.xls --> "Hacker", "Doc" ..........[(B1) unfound!]


case-Y:
EDIT sg.xls to these cells + NO problem with GREP (/STRINGS) commands:
(A1)"Hacker" (B1)"Main : Subs" (C1)"Doc"
#> grep 'Hacker' --> Binary File sg.xls matched
#> grep 'Subs' --> Binary File sg.xls matched
#> grep 'Doc' --> Binary File sg.xls matched
#> strings sg.xls --> "Hacker", "Main : Subs", "Doc"


case-Z:
RE-EDIT sg.xls to these cells + NO problem with GREP (/STRINGS) commands:
(A1)"Hacker" (B1)"Main -- Subs" (C1)"Doc"
#> grep 'Hacker' --> Binary File sg.xls matched
#> grep 'Subs' --> Binary File sg.xls matched
#> grep 'Doc' --> Binary File sg.xls matched
#> strings sg.xls --> "Hacker", "Main -- Subs", "Doc"


Yes, I appreciate your points and was perhaps surprised to discover a total consistency apart from the one mentioned. What I am getting at is that because case-Z falls into line, there must be something could be applied to case-X apart from a manual edit to cause it fall into line also, and make me outrageously happy.

(It may or may not be connected, but grep will choke when coming across a directory named like "--F" when searching a tree.)

ps. have also discovered that length of printable string is irrelevant.

Last edited by Ystack; 09-19-2012 at 06:42 PM.
 
Old 09-19-2012, 06:47 PM   #6
Ystack
LQ Newbie
 
Registered: Dec 2005
Posts: 20

Original Poster
Rep: Reputation: 0
btw my warehouse has hundreds of lookalike boxes with each represented by a discreet xls spreadsheet. My search engine speedily tells me (for many years now) which box an item is in (and, of course, if I even have that item) .. well, 98% of the time. Solving the remaining 2% of cases is my goal.
 
Old 09-19-2012, 08:57 PM   #7
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,395
Blog Entries: 2

Rep: Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903
It seems that you've proven that the user-visible content of a spreadsheet does not necessarily get stored in that format. Your manipulation technique to coerce a favorable file format may actually work, but I think your sample size is a bit small so far. The tool od can be used to reliably examine the file format.

Last edited by theNbomr; 09-19-2012 at 09:00 PM.
 
1 members found this post helpful.
Old 09-20-2012, 12:30 AM   #8
Ystack
LQ Newbie
 
Registered: Dec 2005
Posts: 20

Original Poster
Rep: Reputation: 0
Yes, that makes sense, thanks, but looks like a tough nut to crack.
 
Old 09-20-2012, 02:32 AM   #9
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 1,811

Rep: Reputation: 502Reputation: 502Reputation: 502Reputation: 502Reputation: 502Reputation: 502
Yet another possibility: perhaps your excel automagically converts '--' to 'en-dash' or 'em-dash'

Last edited by NevemTeve; 09-20-2012 at 02:33 AM.
 
Old 09-20-2012, 10:01 AM   #10
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,395
Blog Entries: 2

Rep: Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903
Perhaps use a more sophisticated tool such as a module for reading spreadsheets in a scripting language such as Perl or Python. A search of CPAN for 'spreadsheet read excel' seems to turn up some likely candidates for Perl. There is a reason that these things exist.
--- rod.
 
2 members found this post helpful.
Old 09-21-2012, 05:36 PM   #11
Ystack
LQ Newbie
 
Registered: Dec 2005
Posts: 20

Original Poster
Rep: Reputation: 0
Mmmm. Don't quite get the 'en-dash' or 'em-dash' suggestion? Perhaps not relevant here since I don't use Excel myself -- boxes & spreadsheets compiled elsewhere and sent here. I view/edit with ooCalc, so that might complicate this approach.

Yes, I already use the Perl spreadsheet manipulation modules on one box and could indeed, as suggested, build a new & more complex search-engine when time permits.

grep embedded in a simple perl wrapper was just such a sweet solution. Copy some directories of xls files to a pendrive with one wee perl script and wallah! ... a search engine -- (no extra infrastructure needed on most linux boxes today).

The wrapper filters out those odd cases on which grep chokes (as above), and the need for it has me not entirely convinced yet that it is the spreadsheets I must 'fix' (though I do appreciate that the most likely, and perhaps in the end the only, approach that may produce a solution -- though a less sweet one!).
 
Old 04-29-2013, 02:12 PM   #12
Ystack
LQ Newbie
 
Registered: Dec 2005
Posts: 20

Original Poster
Rep: Reputation: 0
Grep update

Since the above it has come to my attention that if GREP encounters a folder with leading double-dashes (eg "--DD") it chokes here also. I think this means I should stop looking for problems in the spreadsheet content.
 
Old 04-29-2013, 04:21 PM   #13
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 1,811

Rep: Reputation: 502Reputation: 502Reputation: 502Reputation: 502Reputation: 502Reputation: 502
Sounds implausible. Give an example, please.
 
Old 04-29-2013, 09:45 PM   #14
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,289

Rep: Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034
Searching through binaries with 'strings' is informative, but not necessarily supposed to definitive, even in its normal usage eg checking for linked in libs.
In your case I'd either extend the Perl+SSmodule code, or just export to csv format.
I think that will take care of the special double dash by converting it to something more reasonable.
 
1 members found this post helpful.
Old 04-30-2013, 02:21 AM   #15
Ystack
LQ Newbie
 
Registered: Dec 2005
Posts: 20

Original Poster
Rep: Reputation: 0
Couple of workaround leads to pursue there. Thanks Chris.

************************************************************

As for example (fair request)...just a very simple case:

I would use a fairly empty folder (say /mytest) . Put a link to some other folder in there (maybe with string "perg" in some file inside) and rename the link to "--Z". Now:

/mytest#> grep -i "perg" *

grep: unrecognized option '--Z'
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.

/mytest#>
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
grep multiple strings GEEXTER Linux - General 7 12-06-2013 09:56 PM
grep files based on strings verse123 Linux - Newbie 5 08-16-2012 04:36 PM
[SOLVED] grep multiple strings krist_m Linux - Newbie 4 01-11-2011 11:43 AM
grep searching for strings with '(apostrophe) macsdev Programming 5 11-11-2010 11:46 PM
grep two strings Hondro Linux - General 3 09-08-2008 09:55 PM


All times are GMT -5. The time now is 02:09 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration