'STRINGS' and 'GREP' INCONSISTENCY MYSTERY

Ystack · 09-18-2012, 06:58 PM

The 'strings' command can look for string matches in binary files. I use it to look through spreadsheets in '.xls' format with some success, but not consistent success.

Narrowing it down I have found that any line with a "--" (double-dash) in it may not be returned, ie not identified as a string.

However, If I edit the spreadsheet changing the "--" to ":" then those lines are correctly string-matched & returned.

But, on re-editing the spreadsheet changing the ":" back to "--" then the lines are correctly string-matched & returned.

(btw the 'grep' command will also now succeed similarly on that spreadsheet where it failed previously).

No doubt this has something to do with the special nature of "--" to the Bash shell, but I am thinking there must be some meta setting involved here which would produce consistent success, but what? The hundreds of spreadsheets I need to search through frequently all have many, many lines with "--" (double-dash) in them.

ps. The search-term I am trying to match often occurs in a string containing "--".

NevemTeve · 09-19-2012, 02:39 AM

From your words I guess this is what doesn't work:

Code:

FIND="--"; string sg.xls | grep "$FIND"

or this:

Code:

FIND="-v"; string sg.xls | grep "$FIND"

Try this one:

Code:

FIND="--"; string sg.xls | grep -- "$FIND"

Ystack · 09-19-2012, 04:54 PM

Thanks. Unfortunately I don't have a known-to-fail example to hand to try out since the two I had got 'fixed' via file-editing (as aforementioned) while working on the issue.

FIND="--"; strings sg.xls | grep -- "$FIND"

is not likely to succeed on problem cases, however, because

strings sg.xls

returns all string-lines EXCEPT THOSE CONTAINING "--" in the problem cases, (unless and until I have manually edited/re-edited strings.xls as described above). So grep will not even get to see the problem lines when there is a problem.

The inconsistency makes it difficult to locate or reproduce problem cases, but when I find another one I'm sure I will be able to confirm this with absolute certainty. It only arises when I do a search that I already know (from the outside, so to speak) should match in a certain file but fails to do so and, of course, I am not ever searching to actually match "--", but rather string stuff alongside it (ie in the same line in which the "--" features).

Sorry, not sure what the markup for <code> is this forum.

theNbomr · 09-19-2012, 05:15 PM

How are you certain that strings 'returns all string-lines EXCEPT THOSE CONTAINING "--"'. Strings simply returns sequences of printable characters. It doesn't do anything line oriented. Since you are scanning spreadsheets, it is possible that the internal storage format of the spreadsheet file treats the "--" sequence specially. It is also possible that strings contained in the spreadsheet are not stored as contiguous strings within the file (although it is likely that short strings are stored contiguously).

--- rod

Ystack · 09-19-2012, 06:12 PM

Example:

case-X:
sg.xls has these cells + there is a problem with GREP (/STRINGS) commands:
(A1)"Hacker" (B1)"Main -- Subs" (C1)"Doc"
#> grep 'Hacker' --> Binary File sg.xls matched
#> grep 'Subs' --> ..........................[no matches found]..
#> grep 'Doc' --> Binary File sg.xls matched
#> strings sg.xls --> "Hacker", "Doc" ..........[(B1) unfound!]

case-Y:
EDIT sg.xls to these cells + NO problem with GREP (/STRINGS) commands:
(A1)"Hacker" (B1)"Main : Subs" (C1)"Doc"
#> grep 'Hacker' --> Binary File sg.xls matched
#> grep 'Subs' --> Binary File sg.xls matched
#> grep 'Doc' --> Binary File sg.xls matched
#> strings sg.xls --> "Hacker", "Main : Subs", "Doc"

case-Z:
RE-EDIT sg.xls to these cells + NO problem with GREP (/STRINGS) commands:
(A1)"Hacker" (B1)"Main -- Subs" (C1)"Doc"
#> grep 'Hacker' --> Binary File sg.xls matched
#> grep 'Subs' --> Binary File sg.xls matched
#> grep 'Doc' --> Binary File sg.xls matched
#> strings sg.xls --> "Hacker", "Main -- Subs", "Doc"

Yes, I appreciate your points and was perhaps surprised to discover a total consistency apart from the one mentioned. What I am getting at is that because case-Z falls into line, there must be something could be applied to case-X apart from a manual edit to cause it fall into line also, and make me outrageously happy.

(It may or may not be connected, but grep will choke when coming across a directory named like "--F" when searching a tree.)

ps. have also discovered that length of printable string is irrelevant.

Ystack · 09-19-2012, 06:47 PM

btw my warehouse has hundreds of lookalike boxes with each represented by a discreet xls spreadsheet. My search engine speedily tells me (for many years now) which box an item is in (and, of course, if I even have that item) .. well, 98% of the time. Solving the remaining 2% of cases is my goal.

theNbomr · 09-19-2012, 08:57 PM

It seems that you've proven that the user-visible content of a spreadsheet does not necessarily get stored in that format. Your manipulation technique to coerce a favorable file format may actually work, but I think your sample size is a bit small so far. The tool od can be used to reliably examine the file format.

Ystack · 09-20-2012, 12:30 AM

Yes, that makes sense, thanks, but looks like a tough nut to crack.

NevemTeve · 09-20-2012, 02:32 AM

Yet another possibility: perhaps your excel automagically converts '--' to 'en-dash' or 'em-dash'

theNbomr · 09-20-2012, 10:01 AM

Perhaps use a more sophisticated tool such as a module for reading spreadsheets in a scripting language such as Perl or Python. A search of CPAN for 'spreadsheet read excel' seems to turn up some likely candidates for Perl. There is a reason that these things exist.
--- rod.

Ystack · 09-21-2012, 05:36 PM

Mmmm. Don't quite get the 'en-dash' or 'em-dash' suggestion? Perhaps not relevant here since I don't use Excel myself -- boxes & spreadsheets compiled elsewhere and sent here. I view/edit with ooCalc, so that might complicate this approach.

Yes, I already use the Perl spreadsheet manipulation modules on one box and could indeed, as suggested, build a new & more complex search-engine when time permits.

grep embedded in a simple perl wrapper was just such a sweet solution. Copy some directories of xls files to a pendrive with one wee perl script and wallah! ... a search engine -- (no extra infrastructure needed on most linux boxes today).

The wrapper filters out those odd cases on which grep chokes (as above), and the need for it has me not entirely convinced yet that it is the spreadsheets I must 'fix' (though I do appreciate that the most likely, and perhaps in the end the only, approach that may produce a solution -- though a less sweet one!).

Ystack · 04-29-2013, 02:12 PM

Since the above it has come to my attention that if GREP encounters a folder with leading double-dashes (eg "--DD") it chokes here also. I think this means I should stop looking for problems in the spreadsheet content.

NevemTeve · 04-29-2013, 04:21 PM

Sounds implausible. Give an example, please.

chrism01 · 04-29-2013, 09:45 PM

Searching through binaries with 'strings' is informative, but not necessarily supposed to definitive, even in its normal usage eg checking for linked in libs.
In your case I'd either extend the Perl+SSmodule code, or just export to csv format.
I think that will take care of the special double dash by converting it to something more reasonable.

Ystack · 04-30-2013, 02:21 AM

Couple of workaround leads to pursue there. Thanks Chris.

************************************************************

As for example (fair request)...just a very simple case:

I would use a fairly empty folder (say /mytest) . Put a link to some other folder in there (maybe with string "perg" in some file inside) and rename the link to "--Z". Now:

/mytest#> grep -i "perg" *

grep: unrecognized option '--Z'
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.

/mytest#>