grep both boolean and recursive

chiendarret · 12-08-2023, 08:49 AM

I am interested in using grep to find the occurrence of two words in a directory comprising sub-directories and their regular files. Example

francesco@vaio:~/softw/CHARMM_FF$ grep -E 'OC2D1.*NONBONDED | NONBONDED.*OC2D1' ~/softw/CHARMM_FF

where CHARMM_FF is such a directory. Could '-r' be added somewhere, or what else?
Thanks
chiendarret

Turbocapitalist · 12-08-2023, 08:57 AM

Do you want a logical AND or a logical OR?

You can do an AND with xargs:

Code:

grep -r -l -E 'FirstPattern' ~/softw/CHARMM_FF/ | xargs grep -H -E 'SecondPattern'

You can do an OR with an operator:

Code:

grep -r -l -E 'FirstPattern|SecondPattern' ~/softw/CHARMM_FF/

Not sure about XOR though.

boughtonp · 12-08-2023, 09:59 AM

The regex pattern "first.*second|second.*first" will do what you ask, if the words are on the same line OR the compiler has been told "." should include newline.

To do the latter with grep, you can use the -z flag, e.g:

Code:

grep -Erlz 'first.*second|second.*first' directory

Depending on the size of the files and where the words are likely to be, it might be more efficient to use ".*?" instead of ".*", or to set a maximum distance, e.g. with ".{0,1000}", or indeed to use multiple greps - but the example in post #2 should probably be using "-Z" (uppercase) for grep and "-0" (zero) for xargs to handle filenames reliably:

Code:

grep -rlZ 'first' directory | xargs -0 grep -l 'second'

(And if matching individual words - without any regex syntax - one could also use "-Fw" in the above example, so -FwrlZ and -Fwl respectively.)

chiendarret · 12-08-2023, 10:09 AM

Thanks, it saved me much time, but not for all cases.

Quote:

$ grep -r -l -E 'OC2D1' ~/softw/CHARMM_FF/ | xargs grep -H -E 'NONBONDED'

correctly found the file containing NONBONDED data for atom type OC2D1. Great!
....................

Quote:

$ grep -r -l -E 'OBL' ~/softw/CHARMM_FF/ | xargs grep -H -E 'NONBONDED'

did not find NONBONDED data for atom type OBL, while

Quote:

$ grep -r -l -E 'NONBONDED' ~/softw/CHARMM_FF/ | xargs grep -H -E 'OBL'

found NONBONDED section of the file lacking data for any atom type. That is, it seems to me not to have acted as 'AND'

By saying that, I assume that above codes search in the given order, i.e., first before the pipe and than after the pipe.

But probably I am wrong in some way in using your code. My aim is to find data for atom type OBL within NONBONDED

Keeping in mind that section NONBONDED is always the last one in any file.

Thanks

Turbocapitalist · 12-08-2023, 10:18 AM

Some sample data would needed then, sanitized if necessary. Please show a few lines which include stuff that won't be found along with several permutations of stuff that should be found.

Turbocapitalist · 12-08-2023, 10:24 AM

Or do you mean per line rather than per file?

Code:

grep -r -H -E 'FirstPattern' ~/softw/CHARMM_FF/ | grep -E 'SecondPattern'

allend · 12-09-2023, 08:23 AM

Given the occurrence of CHARMM and

Quote:

My aim is to find data for atom type OBL within NONBONDED

Keeping in mind that section NONBONDED is always the last one in any file.

then the file format is probably like this.

As grep is line oriented, I would not consider it to be the right tool for the OP to be using.

Turbocapitalist · 12-09-2023, 08:43 AM

Quote:

Originally Posted by allend

then the file format is probably like this.

It looks difficult to duplicate as a structure and thus search. Maybe try an associative array of associative arrays with lists. I'd try Perl but perhaps YottaDB or similar key-value system might be in order?

I would guess there are Perl modules or sample programs out there already and maybe some Python.

computersavvy · 12-09-2023, 02:45 PM

why not try

Code:

grep -ir 'OC2D1' ~/softw/CHARMM_FF | grep -i 'NONBONDED'

That should produce a list that contains both terms for the entire directory tree.

pan64 · 12-10-2023, 05:12 AM

Quote:

Originally Posted by allend

Given the occurrence of CHARMM and

then the file format is probably like this.

As grep is line oriented, I would not consider it to be the right tool for the OP to be using.

yes, I would go with awk, perl or python. But I'm not really familiar with this format, so I don't really know what the right way would be.

MadeInGermany · 12-10-2023, 05:53 AM

Quote:

found NONBONDED section of the file lacking data for any atom type.

There seems to be some structure...
In contrast to grep, awk (or perl or python) can parse a structure.
Please provide an input sample!