LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 08-27-2013, 01:28 PM   #1
Perseus
Member
 
Registered: Oct 2011
Posts: 164

Rep: Reputation: Disabled
Get strings distributed along up to 3 lines


Hello to all in forum,

Please some help.

I don't know if is a work for awk, sed, perl,etc.

Having the following text, I want to extract 2 patterns and print related patterns in the same line:
Code:
pattern1: bc[0-9]d
pattern2: jk[0-9]lmnopqrs
How to know if they are related? Pattern1 always happens, but pattern2 not always. Then, if pattern1 is found and
the next pattern found is pattern2, then they are related and should be printed in the same line. If 2 consecutive
patterns1 are found (in 2 or 3 lines or in the same line), it means that for the previous pattern1 there is no pattern2.

Input:
Code:
abc1defghi
jk3lmnopqr
stuvwxyzza
bc4defghij
klmnopuqrs
tuvwxxyzab
c8defghijk
4lmnopqrst
uvwxyzwwww
Output desired:
Code:
bc1d ijk3lmnopqrs
bc4d 
bc8d ijk4lmnopqrs
I don't now if with awk is possible because the problem is that awk
reads line by line and as you can see, the patterns could begin in one line
and ends in the next one. And even begin in one line and ends 2 lines below.
The goal is know how to do it for this sample file and then, extend it for
a big file.

Thanks in advance for any help.

Last edited by Perseus; 08-27-2013 at 01:58 PM.
 
Old 08-27-2013, 03:11 PM   #2
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 15

Rep: Reputation: 233Reputation: 233Reputation: 233
If your text is not divided by newlines you could use grep:
Code:
grep -o -e 'bc[0-9]d' -e 'jk[0-9]lmnopqrs' file
Output:
Code:
bc1d
jk3lmnopqrs
bc4d
bc8d
jk4lmnopqrs
With that output it should now be easy to select which are valid.
 
Old 08-27-2013, 03:20 PM   #3
Perseus
Member
 
Registered: Oct 2011
Posts: 164

Original Poster
Rep: Reputation: Disabled
Hello konsolebox,

Thanks for answer.

The file doesn't have blank lines, but it has newlines characters as any standard file.

I'm trying in Cygwin but I only get this result.
Code:
$ grep -o -e 'bc[0-9]d' -e 'jk[0-9]lmnopqrs' file
bc1d
bc4d
Thanks in advance for any help.
 
Old 08-27-2013, 03:35 PM   #4
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 15

Rep: Reputation: 233Reputation: 233Reputation: 233
You can use this C code to convert your file:

Code:
#include <unistd.h>

#define BUFFER_SIZE 2000
char buffer[BUFFER_SIZE];

int main (void) {
    int count;
    while ((count = read(0, buffer, BUFFER_SIZE))) {
        int i, j;
        for (i = 0, j = 0; i < count; ++i) {
            if (buffer[i] == '\n') {
                if (i > j) {
                    write(1, buffer + j, i - j);
                }
                j = i + 1;
            }
        }
        if (i > j) {
            write(1, buffer + j, i - j);
        }
    }
}
Compile it and do:
Code:
./output_binary < file | grep -o -e 'bc[0-9]d' -e 'jk[0-9]lmnopqrs'

Last edited by konsolebox; 08-27-2013 at 03:46 PM.
 
Old 08-27-2013, 03:57 PM   #5
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,101

Rep: Reputation: 288Reputation: 288Reputation: 288
Quote:
Originally Posted by Perseus View Post
The file doesn't have blank lines, but it has newlines characters as any standard file.
You may use the excellent grep provided by konsolebox this way ...
Code:
 paste -s -d"\0" <$InFile2                   \
|grep -o -e 'bc[0-9]d' -e 'jk[0-9]lmnopqrs'  \
|paste -s -d" "                              \
|sed 's/\(bc[0-9]d\)/\n\1/g'                 \
>$OutFile
... to produce this ...
Code:
bc1d jk3lmnopqrs 
bc4d 
bc8d jk4lmnopqrs
Daniel B. Martin

Last edited by danielbmartin; 08-27-2013 at 07:54 PM. Reason: Cosmetic improvement
 
Old 08-27-2013, 04:19 PM   #6
Perseus
Member
 
Registered: Oct 2011
Posts: 164

Original Poster
Rep: Reputation: Disabled
Hello konsolbox and Daniel,

I'll try asap your code. The original input file is a dump from a binary file got with xxd command and produces a file of 4GB with 256 characters per line.

Do you I could use the same code with this large file?
Or there is a way to use the regex for the patterns to read directly from binary?

Thanks for help again.
 
Old 08-28-2013, 01:45 AM   #7
Perseus
Member
 
Registered: Oct 2011
Posts: 164

Original Poster
Rep: Reputation: Disabled
Hello konsolebox and Daniel,

I have an issue to extract patterns when they are in the same line.

If I want to extract the patterns c+number+some characters + k+ number + 7 chracters (in blue below):
Code:
abc1defghijk3lyyuopqtstuvwxyzzabc4defghijklmnopuqrstuvwxxyzabc8defghijk5lmnopqrstuvwxyzwwww
I'm getting instead of those 2 strings, the long string below.
Code:
$ echo "abc1defghijk3lmnopqrstuvwxyzzabc4defghijklmnopuqrstuvwxxyzabc8defghijk5lmnopqrstuvwxyzwwww" | grep -o -e 'c[0-9].*k[0-9].\{7\}'
c1defghijk3lmnopqrstuvwxyzzabc4defghijklmnopuqrstuvwxxyzabc8defghijk5lmnopqr
How can set grep to extract separated those 2 strings?

Thanks in advance for your help.
 
Old 08-28-2013, 01:59 AM   #8
pan64
Senior Member
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 4,773

Rep: Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272
I would try to use the string bc as line separator (instead of newline)
next remove all the newlines
finally print matching lines using regexp like
^[0-9]d.*jk[0-9]lmnopqrs

you can use awk or perl to implement it
 
Old 08-28-2013, 02:11 AM   #9
Perseus
Member
 
Registered: Oct 2011
Posts: 164

Original Poster
Rep: Reputation: Disabled
Hello Pan64,

May you help me please in how to it in awk or perl.

The thing is as explained in first post, I need 2 patterns. Pattern1 always happens
And patter2 not always, but both could be in more than one or two lines
With an input of 128 bytes per line (xxd used to dump).

Thanks for any help
 
Old 08-28-2013, 02:21 AM   #10
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 15

Rep: Reputation: 233Reputation: 233Reputation: 233
@Perseus Have you tried my solution? So how was it? What was needed to change it?
 
Old 08-28-2013, 03:26 AM   #11
pan64
Senior Member
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 4,773

Rep: Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272
Something like this:
\n? is there because newline can be found almost anywhere
Code:
awk 'BEGIN { RS="b\n?c"; }                  # set record separator
     ! /^\n?[0-9]\n?d/ { next }             # skip lines
   { gsub("\n", "");                        # remove \n
     printf "bc" substr($0, 0, 2);          
     if ( match($0, "jk[0-9]lmnopqrs") ) 
         printf " " substr($0, RSTART, RLENGTH);
      print ""
   } ' input.txt
 
1 members found this post helpful.
Old 08-28-2013, 10:43 AM   #12
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,101

Rep: Reputation: 288Reputation: 288Reputation: 288
Quote:
Originally Posted by Perseus View Post
... I want to extract the patterns c+number+some characters + k+number+7 characters ...
Try this ...
Code:
awk -F "" 'BEGIN {RS="c"} 
  {k=index($0,"k");
   if (k>0 && NF>k+7 && "0123456789"~$1 && "0123456789"~$(k+1))
     print RS substr($0,1,k+8)}' $InFile >$OutFile
Daniel B. Martin
 
Old 08-28-2013, 11:58 PM   #13
Perseus
Member
 
Registered: Oct 2011
Posts: 164

Original Poster
Rep: Reputation: Disabled
Hello to all

Mamy thanks for the help and time to help.

Sure I've tried the codes of all of you, but when I try to replicate in a real file with grep or awk,
it seems the regex is not working for pattern-2. I want to extract these patterns:

pattern-1: ff77 + 6 to 18 characters + 532064 + 10 characters + 814 + 13 characters
pattern-2: 059 + 32 to 34 characters + some characters + 940e + 28 characters

For pattern1 the regex I'm using is working, but for the pattern 2 is taken more characters that
I want.

Regex used for pattern-1: ff77.{6,18}532064.{10}814.{13} --> it works
Regex for pattern-2: 059.{32,34}.*940e.\{28\} --> Is taken character belonging to more than one pattern2.

Always, after the end of pattern-2 it follows 9506.

The regex for pattern-2 I have now is taken all characters in red.
Code:
93114444444c55535f529332939333303693303032353807ffffffffffffffff77000001532064022272619f81422060001fffff0015000a4800015a00074200
013300013600013700016600016500017700016900017900009300012200002100010900010a00012600010800012b00002c00002d00002e0000550000560007
2a00002f0000300000930000ff3400800932c90600000000a000800935c90600000000000080093cc90600000000800005910f01020000000d8147451907ffff
ff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff020102019506000000000000ff7700
0002532064014041612f81422060002fffff0015000a4800015a0007420001330001360001370001660001650001770001690001790000930001220000210001
0900010a00012600010800012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90600000000a000800935c906000000
00000080093cc90600000000800005910f01020000000d8147451925ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559f
ffff00940e01020102010001ffffff020102019506000000000000ff77000003532064022280546f81422060003fffff0015000a4800015a0007420001330001
3600013700016600016500017700016900017900009300012200002100010900010a00012600010800012b00002c00002d00002e00005500005600072a00002f
0000300000930000ff3400800932c90600000000a000800935c90600000000000080093cc90600000000800005910f01020000000d8147451905ffffff008930
010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff020102019506000000000000ff770000045320
64022939276f81422060004fffff0015000a4800015a00074200013300013600013700016600016500017700016900017900009300012200002100010900010a
00012600010800012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90600000000a000800935c90600000000000080
093cc90600000000800005910f01020000000d8147451944ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff0094
0e01020102010001ffffff020102019506000000000000ff77000005532064013741169f81422060354fffff0015000a4800015a000260000133000136000137
00017e00016900006a00007900009300012200002100010900010a00012600010200010400010500010600011000010800012b00002c00002d00002e00005500
005600072a00002f0000300000930000ff3400800932c90688888000a000800935c906000080000000800943c9068888800080000582002e0501000001006500
00000200000200180000000300000300170000000400000400010000000a00ffff0065000000ff77000006532064013741255f81422079900fffff0015000a48
00015a00026000013300013600013700017e00016900006a00007900009300012200002100010900010a00012600010200010400010500010600011000010800
012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90688888000a000800935c906000080000000800943c906888880
And the output desired for regex 2 is:
Code:
93114444444c55535f529332939333303693303032353807ffffffffffffffff77000001532064022272619f81422060001fffff0015000a4800015a00074200
013300013600013700016600016500017700016900017900009300012200002100010900010a00012600010800012b00002c00002d00002e0000550000560007
2a00002f0000300000930000ff3400800932c90600000000a000800935c90600000000000080093cc90600000000800005910f01020000000d8147451907ffff
ff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff020102019506000000000000ff7700
0002532064014041612f81422060002fffff0015000a4800015a0007420001330001360001370001660001650001770001690001790000930001220000210001
0900010a00012600010800012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90600000000a000800935c906000000
00000080093cc90600000000800005910f01020000000d8147451925ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559f
ffff00940e01020102010001ffffff020102019506000000000000ff77000003532064022280546f81422060003fffff0015000a4800015a0007420001330001
3600013700016600016500017700016900017900009300012200002100010900010a00012600010800012b00002c00002d00002e00005500005600072a00002f
0000300000930000ff3400800932c90600000000a000800935c90600000000000080093cc90600000000800005910f01020000000d8147451905ffffff008930
010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff020102019506000000000000ff770000045320
64022939276f81422060004fffff0015000a4800015a00074200013300013600013700016600016500017700016900017900009300012200002100010900010a
00012600010800012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90600000000a000800935c90600000000000080
093cc90600000000800005910f01020000000d8147451944ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff0094
0e01020102010001ffffff020102019506000000000000ff77000005532064013741169f81422060354fffff0015000a4800015a000260000133000136000137
00017e00016900006a00007900009300012200002100010900010a00012600010200010400010500010600011000010800012b00002c00002d00002e00005500
005600072a00002f0000300000930000ff3400800932c90688888000a000800935c906000080000000800943c9068888800080000582002e0501000001006500
00000200000200180000000300000300170000000400000400010000000a00ffff0065000000ff77000006532064013741255f81422079900fffff0015000a48
00015a00026000013300013600013700017e00016900006a00007900009300012200002100010900010a00012600010200010400010500010600011000010800
012b00002c00002d00002e00005500005600072a00002f0000300000930000ff3400800932c90688888000a000800935c906000080000000800943c906888880
Thanks in advance for any help.

Last edited by Perseus; 08-29-2013 at 12:03 AM.
 
Old 08-29-2013, 02:04 AM   #14
pan64
Senior Member
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 4,773

Rep: Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272Reputation: 1272
yes, this is the greediness of the regexp I think. You need to set ff77 as record separator to avoid such problems.
 
Old 08-29-2013, 02:06 AM   #15
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 629

Rep: Reputation: 368Reputation: 368Reputation: 368Reputation: 368
Hi.

If you use grep or perl, you may use non-greedy regex `.*?', like this:

Code:
$ tr -d '\n' <infile | grep -Po '059.{32,34}.*?940e.{28}'
05910f01020000000d8147451907ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
05910f01020000000d8147451925ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
05910f01020000000d8147451905ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
05910f01020000000d8147451944ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
`-P' option tells grep to use perl regular expressions.

Last edited by firstfire; 08-29-2013 at 02:07 AM. Reason: Mention -P.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Appending matching strings to specific lines (sed/bash) suntzu Programming 18 09-08-2012 03:29 PM
[SOLVED] search for 2 different strings in 2 diffrent lines threezerous Linux - Newbie 8 07-30-2012 03:42 PM
truncate strings on many lines mufea Linux - Newbie 2 02-23-2012 06:29 AM
How to remove lines and parts of lines from python strings? golmschenk Programming 3 11-26-2009 11:29 PM
Extract lines containing some strings without affectting sequential order cgcamal Programming 7 11-06-2008 11:57 PM


All times are GMT -5. The time now is 03:47 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration