[SOLVED] Get strings distributed along up to 3 lines

grail · 08-29-2013, 10:50 AM

Borrowing from pan64 and firstfire:

Code:

$ ruby -ne 'BEGIN{$/="ff77"};$_.gsub!(/\n/,"");next unless /.{6,18}532064.{10}814.{13}/;puts "#{$/}#{$&}#{(/059.{32,34}.*?940e.{28}/?$&:"")}"' infile

ff77000001532064022272619f81422060001fffff05910f01020000000d8147451907ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
ff77000002532064014041612f81422060002fffff05910f01020000000d8147451925ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
ff77000003532064022280546f81422060003fffff05910f01020000000d8147451905ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
ff77000004532064022939276f81422060004fffff05910f01020000000d8147451944ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
ff77000005532064013741169f81422060354fffff
ff77000006532064013741255f81422079900fffff

Perseus · 08-29-2013, 06:04 PM

Hello to all,

Thanks for the help, really.

Hello firstfire,

It works, but when I include 2 regex I get an error that -P option only works with one regex.
I'm not sure if directly with perl could be handled. Or is possible with Perl read directly
from the binary file to extract the same patterns given by Regex1 and Regex2?

Hello pan64,

It works with your awk code, but I don't know why is not working when I include regex's for the real file, is
not printing anything.

I'm trying with:

Code:

awk 'BEGIN { RS="f\n?f\n?7\n?7"; }
     ! /^.\{6,18\}532064.\{10\}814.\{13\}/ { next }
   { gsub("\n", "");
     printf "ff77" substr($0, 0, 2);          
     if ( match($0, /(059.\{32,34\}.*940e.\{28\})/) ) 
         printf " " substr($0, RSTART, RLENGTH);
      print ""
   } ' 64bytes.txt

Hello grail,

It works just fine. But 2 questions:

1- Since the end of line "\n" could appear anywhere, I've tried to modify the Record separator your used, so
instead of $/="ff77" I've tried $/="f\n?f\n?7\n?7" but it seems is not in that way. How would be the correct?

2- I want to print patterns found in groups, so I've added () to try to use back reference,
but I don't know where to insert (within the "puts" part) the \1","\2","\3".. etc.

Thanks in advance for any help.

grail · 08-30-2013, 03:51 AM

Unfortunately Ruby does not currently support regex RS

But I had a bit of a play, how about something like:

Code:

ruby -ne 'BEGIN{$/="ff77"};$_.gsub!(/\n/,"");$_.split($/).each{|x| next unless x =~ /^.{6,18}532064.{10}814.{13}/;puts "#{$/}#{$&}#{(x =~ /059.{32,34}.*?940e.{28}/?$&:"")}"}' file

So basically, after using ff77 as the RS we then check to see if after removing all the '\n', is ff77 appearing again and if so we split on it and run the same checks.

As for part 2, simply prefix the reference position with $, ie $1 instead of \1

pan64 · 08-30-2013, 04:52 AM

using tr -d '\n' <infile | <command> will eliminate newlines therefore those regexps will be easier to construct:

Code:

tr -d '\n' <64bytes.txt | awk 'BEGIN { RS="ff77"; }
   {
        if ( ! match ( $0, "^.{6,18}532064.{10}814.{13}" ) ) next;
	printf "ff77" substr($0, 1, RLENGTH);          
	if ( match($0, "(059.{32,34}.*940e.{28})") ) 
           printf " " substr($0, RSTART, RLENGTH);
        print ""
   } '

this code worked for me but only with awk 4.0

grail · 08-30-2013, 10:51 AM

@pan64 - I did like the tr and paste methods, I am curious though how applying this to a 4GB file would affect performance?

Perseus · 08-30-2013, 10:57 AM

Hello grail and pan64,

Both scripts work just fine, but for the part explained below I'm only be able to modificate the ruby script.

My last goal is to print separately some characters, so, for this purpose I've inserted back reference with the
grail's suggestion. It works and prints correctly but I don't know how to do the same back reference for the
regex2 in order to print separately some characters of pattern2 too.

Once be able to print in substrings, the last pending issue for me is be able to manipulate the substrings given by the
back reference to print some of them in decimal format and for others remove the f's:

Code:

 ruby -ne 'BEGIN{$/="ff77"};$_.gsub!(/\n/,"");
> $_.split($/).each{|x| next unless
> x =~ /(^.{6,18})(532064.{10})(814.{13})/;
> puts "#{$/} #{$1} #{$2} #{$3} #{(x =~ /059.{32,34}.*?940e.{28}/?$&:"")}"
> }' 64bytes.txt
Current output:
ff77 000001 532064022272619f 81422060001fffff 05910f01020000000d8147451907ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
ff77 000002 532064014041612f 81422060002fffff 05910f01020000000d8147451925ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
ff77 000003 532064022280546f 81422060003fffff 05910f01020000000d8147451905ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
ff77 000004 532064022939276f 81422060004fffff 05910f01020000000d8147451944ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
ff77 000005 532064013741169f 81422060354fffff
ff77 000006 532064013741255f 81422079900fffff

Desired output:
ff77 1 532064022272619 81422060001 05 91 0f 01020000000d8147451907ffffff008 930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
ff77 2 532064014041612 81422060002 05 91 0f 01020000000d8147451925ffffff008 930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
ff77 3 532064022280546 81422060003 05 91 0f 01020000000d8147451905ffffff008 930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
ff77 4 532064022939276 81422060004 05 91 0f 01020000000d8147451944ffffff008 930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00940e01020102010001ffffff02010201
ff77 5 532064013741169 81422060354
ff77 6 532064013741255 81422079900

Thanks a lot for the help and time to all.

grail · 08-30-2013, 12:41 PM

The backreferencing is the same in either portion. The issue you have is either:

1. Alter your regexes to give the level of granularity you are looking for

2. Manipulate the individual returned pieces (ie substrings, which in ruby are array references to portion of the string required ..eg. a[3..5] will return 4th character to sixth as counting starts at 0)

Hope that helps

Note: I would add that the awk has an issue in it currently that the second regex is greedy and will return more than required should '940e' exist more than once.

Perseus · 08-30-2013, 01:02 PM

Hello grail,

The thing is I've added the "y" variable trying to replicate your code, but is not working.

Code:

ruby -ne 'BEGIN{$/="ff77"};$_.gsub!(/\n/,"");
	$_.split($/).each{|x| next unless 
	x =~ /(^.{6,18})(532064.{10})(814.{13})/;
	y =~ /(05)(9.{32,34})(.*?940e.{28})/;
	puts "#{$/} #{$1} #{$2} #{$3} #{$4} #{$5} #{$6}"
	}' 64bytes.txt

The colored regexes work but simply I don't know where add the back reference for the second pattern (the $4, $5 and $6).

grail · 08-31-2013, 03:14 AM

Ok ... I see the issue.

1. The x being used contains the string we wish to search and is associated with |x|, hence your y variable is actually empty hence checking the regex against it will yield nothing

2. The back references are based on the last match found, which by the way is where $& matches the entire regex match. So as y is empty and there are no matches so there are no back references. Now i haven't tested it so I am not sure if they will just reference the last good match or be set to nil as no matches in the last attempt to match.

3. The back references will only extend their numbers based on being used in the same regex, ie if first regex uses 3 and so does the second, they will both be referenced using $1 $2 $3 (hope that makes sense)

See if the following helps, I have only done a couple of back references in second regex to show application:

Code:

ruby -ne 'BEGIN{$/="ff77"};
          $_.gsub!(/\n/,"");
          $_.split($/).each{
             |x| 
             next unless x =~ /^.{6,18}532064.{10}814.{13}/;
             puts "#{$/}#{$&}#{(x =~ /(05)(9.{32,34}).*?940e.{28}/?"#{$1} #{$2}":"")}"
          }' file

Perseus · 08-31-2013, 04:16 AM

Hello grail,

Great, it works! Now I know how to separate the sub patterns for regex1 and regex2 with code below.

Code:

ruby -ne 'BEGIN{$/="ff77"};
          $_.gsub!(/\n/,"");
          $_.split($/).each{
             |x| 
             next unless x =~ /(^.{6,18})(532064.{10})(504.{13})/;
              puts "#{$/} #{$1} #{$2} #{$3} #{(x =~ /(05)(9.{32,34}).*?940e.{28}/?"#{$1} #{$2}":"")}"
          }' file

Now for example, is possible to print #{$1} of regex1 in decimal format (without zeros to the left), and #{$2} and #{$3} of regex1 without f's? or replace #{$2} of regex 2 with some string?

I would like to know and I'll be able to apply it for some sub-patterns of regex2 too. If it is too much complicated, I'll need to think how to set a second script to apply it to the output given by this ruby script.

PS: All of you have helped me a lot with this, but I was wondering if for example for ruby, these regex1 and regex2 could be apply to read the binary file directly and extract the same patterns? because now I need first to dump the binary with "xxd" command and then apply the script to the xxd dump, and with a large file of 4GB the task is a kind of slow. If xxd dump step could be avoided, would be great, if not I'll do in this way that works so fine.

Thanks so much again!

grail · 08-31-2013, 05:44 AM

I see you have 2 options:

1. You can pipe the output xxd directly into the script so no temporary file required

2. Ruby can indeed read binary, see here for more details

As for extraction and changes:

1. Ruby supports substrings as mentioned previously

2. Has the same gsub and sub functions as awk (gsub already being used)

3. Ruby has a print function that will allow the same output as printf in awk and other languages, ie to print decimal number to so many decimal places

You may find the main page that the above link comes from as helpful: http://www.ruby-doc.org/core-2.0.0/

Should you be using an older version of Ruby, simply change the digits after core to reflect the version

Perseus · 08-31-2013, 07:21 PM

Hello grail,

Thanks fot your suggestions. I think read directly from binary it could be more difficult, then
I've continued adding some things to your code, I've been able to remove the f's modifying the
regex, but I don't have success to print as decimal yet.

What I need help is in how to assign to a variable the backreference #1, in order to be able to
insert it within printf after later (the lines in red).

And how to replace 2 characters (91) within strings matched by regex-2, with "Product1" for example.

Code:

ruby -ne 'BEGIN{$/="ff77"};
    $_.gsub!(/\n/,"");
    $_.split($/).each{
        |x| 
		next unless x =~ /(^.{6,18})(532064.{9}).(814(\d){1,13})/;
		var=#{$1}
		printf("%d ","$var")

        puts "#{$/} #{$1} #{$2} #{$3}\		
		#{(x =~ /(05)(9[0-9])([1-9][0-9a-f]|0[e-f]0[1-9a].{26,28}0[0-1]8.*?)(940e)(.{28})/?\
		" #{$1} #{$2}#{$3} #{$4} #{$5}":"")}"
        }' file

I hope you can give a light on how to do this.

Thanks for the help.

Perseus · 09-01-2013, 04:26 AM

Hello grail,

I've been reading over here and reading over there about ruby and after trying and trying
I've been able to remove f's and print as decimal and almost having the output I want.

I don't understand too much how it works the "next unless" you use, I've tried to replaced it by "if" statement and works, I think the if conditions is slower in speed than "next unless", or is the same?
I replaced to "if" because wasn't working for me with next unless with the code added.

This is the code I have so far:

Code:

ruby -ne 'BEGIN{$/="ff77"};
    $_.gsub!(/\n/,"");
    $_.split($/).each{
        |x| 
		if x =~ /(^.{6,18})(532064.{9}).(814(\d){1,13})/;
			printf("%d %s %s","0x"+$1,$2,$3)
			if x =~ /(05)(9[0-9])([1-9][0-9a-f]|0[e-f]0[1-9a].{26,28}0[0-1]8.*?)(940e)(.{28})/;
				str=$2+$3+$4+$5
				map={/91../ => " PROD_1 ",/92../ => " PROD_2 ",/93../ => "PROD_3"}
				map.each {|k,v| str.sub!(k,v)}
				printf(" %s\n",str)
			else
				printf("\n")
			end
		end		
        }' file

I hope be able to complete the desired output and if not I'll come back to request help

Thanks for all the help to all.

Best regards

pan64 · 09-01-2013, 07:56 AM

Quote:

Originally Posted by grail

@pan64 - I did like the tr and paste methods, I am curious though how applying this to a 4GB file would affect performance?

It probably depends on the performance of the /tmp filesystem, and the related resources. Otherwise it should be ok.
From the other hand your script will not work if ff77 is splitted into two lines.

grail · 09-01-2013, 11:24 AM

Quote:

From the other hand your script will not work if ff77 is splitted into two lines.

Actually the new code specifically targets the fact that ff77 may split over multiple lines. It is with the use of the split function that the code now deals with
a portion of the line that was originally between 2 ff77's and once the new lines are removed we now split on ff77 and process each portion accordingly.

@OP

unless is the opposite of an if, like while and until. So one tests for a positive to enter and the other for a negative.
The initial reason for the unless was to skip lines / entries that do not contain the first regex.
From a performance stand point I guess the only real difference is that once the unless is true we continue the next loop immediately, whereas when the if is found to be false
it will jump to the 'end' to see if anything is after it to then continue the next loop. I would imagine there should be no directly visible impact in this though as they both result
in the next loop being iterated over.

May I ask that you include an example output of what you would like to see from the previous input example used in post #13

Note: Not sure how often you will need to run this type of thing, but as it is getting longer I would suggest moving away from the single line entry and placing it inside a script.