ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
The 'strings' command can look for string matches in binary files. I use it to look through spreadsheets in '.xls' format with some success, but not consistent success.
Narrowing it down I have found that any line with a "--" (double-dash) in it may not be returned, ie not identified as a string.
However, If I edit the spreadsheet changing the "--" to ":" then those lines are correctly string-matched & returned.
But, on re-editing the spreadsheet changing the ":" back to "--" then the lines are correctly string-matched & returned.
(btw the 'grep' command will also now succeed similarly on that spreadsheet where it failed previously).
No doubt this has something to do with the special nature of "--" to the Bash shell, but I am thinking there must be some meta setting involved here which would produce consistent success, but what? The hundreds of spreadsheets I need to search through frequently all have many, many lines with "--" (double-dash) in them.
ps. The search-term I am trying to match often occurs in a string containing "--".
Thanks. Unfortunately I don't have a known-to-fail example to hand to try out since the two I had got 'fixed' via file-editing (as aforementioned) while working on the issue.
FIND="--"; strings sg.xls | grep -- "$FIND"
is not likely to succeed on problem cases, however, because
returns all string-lines EXCEPT THOSE CONTAINING "--" in the problem cases, (unless and until I have manually edited/re-edited strings.xls as described above). So grep will not even get to see the problem lines when there is a problem.
The inconsistency makes it difficult to locate or reproduce problem cases, but when I find another one I'm sure I will be able to confirm this with absolute certainty. It only arises when I do a search that I already know (from the outside, so to speak) should match in a certain file but fails to do so and, of course, I am not ever searching to actually match "--", but rather string stuff alongside it (ie in the same line in which the "--" features).
Sorry, not sure what the markup for <code> is this forum.
How are you certain that strings 'returns all string-lines EXCEPT THOSE CONTAINING "--"'. Strings simply returns sequences of printable characters. It doesn't do anything line oriented. Since you are scanning spreadsheets, it is possible that the internal storage format of the spreadsheet file treats the "--" sequence specially. It is also possible that strings contained in the spreadsheet are not stored as contiguous strings within the file (although it is likely that short strings are stored contiguously).
Yes, I appreciate your points and was perhaps surprised to discover a total consistency apart from the one mentioned. What I am getting at is that because case-Z falls into line, there must be something could be applied to case-X apart from a manual edit to cause it fall into line also, and make me outrageously happy.
(It may or may not be connected, but grep will choke when coming across a directory named like "--F" when searching a tree.)
ps. have also discovered that length of printable string is irrelevant.
btw my warehouse has hundreds of lookalike boxes with each represented by a discreet xls spreadsheet. My search engine speedily tells me (for many years now) which box an item is in (and, of course, if I even have that item) .. well, 98% of the time. Solving the remaining 2% of cases is my goal.
It seems that you've proven that the user-visible content of a spreadsheet does not necessarily get stored in that format. Your manipulation technique to coerce a favorable file format may actually work, but I think your sample size is a bit small so far. The tool od can be used to reliably examine the file format.
Perhaps use a more sophisticated tool such as a module for reading spreadsheets in a scripting language such as Perl or Python. A search of CPAN for 'spreadsheet read excel' seems to turn up some likely candidates for Perl. There is a reason that these things exist.
Mmmm. Don't quite get the 'en-dash' or 'em-dash' suggestion? Perhaps not relevant here since I don't use Excel myself -- boxes & spreadsheets compiled elsewhere and sent here. I view/edit with ooCalc, so that might complicate this approach.
Yes, I already use the Perl spreadsheet manipulation modules on one box and could indeed, as suggested, build a new & more complex search-engine when time permits.
grep embedded in a simple perl wrapper was just such a sweet solution. Copy some directories of xls files to a pendrive with one wee perl script and wallah! ... a search engine -- (no extra infrastructure needed on most linux boxes today).
The wrapper filters out those odd cases on which grep chokes (as above), and the need for it has me not entirely convinced yet that it is the spreadsheets I must 'fix' (though I do appreciate that the most likely, and perhaps in the end the only, approach that may produce a solution -- though a less sweet one!).
Since the above it has come to my attention that if GREP encounters a folder with leading double-dashes (eg "--DD") it chokes here also. I think this means I should stop looking for problems in the spreadsheet content.
Searching through binaries with 'strings' is informative, but not necessarily supposed to definitive, even in its normal usage eg checking for linked in libs.
In your case I'd either extend the Perl+SSmodule code, or just export to csv format.
I think that will take care of the special double dash by converting it to something more reasonable.