[SOLVED] Get strings distributed along up to 3 lines
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
It works, but when I include 2 regex I get an error that -P option only works with one regex.
I'm not sure if directly with perl could be handled. Or is possible with Perl read directly
from the binary file to extract the same patterns given by Regex1 and Regex2?
Hello pan64,
It works with your awk code, but I don't know why is not working when I include regex's for the real file, is
not printing anything.
1- Since the end of line "\n" could appear anywhere, I've tried to modify the Record separator your used, so
instead of $/="ff77" I've tried $/="f\n?f\n?7\n?7" but it seems is not in that way. How would be the correct?
2- I want to print patterns found in groups, so I've added () to try to use back reference,
but I don't know where to insert (within the "puts" part) the \1","\2","\3".. etc.
Unfortunately Ruby does not currently support regex RS
But I had a bit of a play, how about something like:
Code:
ruby -ne 'BEGIN{$/="ff77"};$_.gsub!(/\n/,"");$_.split($/).each{|x| next unless x =~ /^.{6,18}532064.{10}814.{13}/;puts "#{$/}#{$&}#{(x =~ /059.{32,34}.*?940e.{28}/?$&:"")}"}' file
So basically, after using ff77 as the RS we then check to see if after removing all the '\n', is ff77 appearing again and if so we split on it and run the same checks.
As for part 2, simply prefix the reference position with $, ie $1 instead of \1
Both scripts work just fine, but for the part explained below I'm only be able to modificate the ruby script.
My last goal is to print separately some characters, so, for this purpose I've inserted back reference with the
grail's suggestion. It works and prints correctly but I don't know how to do the same back reference for the
regex2 in order to print separately some characters of pattern2 too.
Once be able to print in substrings, the last pending issue for me is be able to manipulate the substrings given by the
back reference to print some of them in decimal format and for others remove the f's:
The backreferencing is the same in either portion. The issue you have is either:
1. Alter your regexes to give the level of granularity you are looking for
2. Manipulate the individual returned pieces (ie substrings, which in ruby are array references to portion of the string required ..eg. a[3..5] will return 4th character to sixth as counting starts at 0)
Hope that helps
Note: I would add that the awk has an issue in it currently that the second regex is greedy and will return more than required should '940e' exist more than once.
1. The x being used contains the string we wish to search and is associated with |x|, hence your y variable is actually empty hence checking the regex against it will yield nothing
2. The back references are based on the last match found, which by the way is where $& matches the entire regex match. So as y is empty and there are no matches so there are no back references. Now i haven't tested it so I am not sure if they will just reference the last good match or be set to nil as no matches in the last attempt to match.
3. The back references will only extend their numbers based on being used in the same regex, ie if first regex uses 3 and so does the second, they will both be referenced using $1 $2 $3 (hope that makes sense)
See if the following helps, I have only done a couple of back references in second regex to show application:
Code:
ruby -ne 'BEGIN{$/="ff77"};
$_.gsub!(/\n/,"");
$_.split($/).each{
|x|
next unless x =~ /^.{6,18}532064.{10}814.{13}/;
puts "#{$/}#{$&}#{(x =~ /(05)(9.{32,34}).*?940e.{28}/?"#{$1} #{$2}":"")}"
}' file
Now for example, is possible to print #{$1} of regex1 in decimal format (without zeros to the left), and #{$2} and #{$3} of regex1 without f's? or replace #{$2} of regex 2 with some string?
I would like to know and I'll be able to apply it for some sub-patterns of regex2 too. If it is too much complicated, I'll need to think how to set a second script to apply it to the output given by this ruby script.
PS: All of you have helped me a lot with this, but I was wondering if for example for ruby, these regex1 and regex2 could be apply to read the binary file directly and extract the same patterns? because now I need first to dump the binary with "xxd" command and then apply the script to the xxd dump, and with a large file of 4GB the task is a kind of slow. If xxd dump step could be avoided, would be great, if not I'll do in this way that works so fine.
1. You can pipe the output xxd directly into the script so no temporary file required
2. Ruby can indeed read binary, see here for more details
As for extraction and changes:
1. Ruby supports substrings as mentioned previously
2. Has the same gsub and sub functions as awk (gsub already being used)
3. Ruby has a print function that will allow the same output as printf in awk and other languages, ie to print decimal number to so many decimal places
Thanks fot your suggestions. I think read directly from binary it could be more difficult, then
I've continued adding some things to your code, I've been able to remove the f's modifying the
regex, but I don't have success to print as decimal yet.
What I need help is in how to assign to a variable the backreference #1, in order to be able to
insert it within printf after later (the lines in red).
And how to replace 2 characters (91) within strings matched by regex-2, with "Product1" for example.
I've been reading over here and reading over there about ruby and after trying and trying
I've been able to remove f's and print as decimal and almost having the output I want.
I don't understand too much how it works the "next unless" you use, I've tried to replaced it by "if" statement and works, I think the if conditions is slower in speed than "next unless", or is the same?
I replaced to "if" because wasn't working for me with next unless with the code added.
This is the code I have so far:
Code:
ruby -ne 'BEGIN{$/="ff77"};
$_.gsub!(/\n/,"");
$_.split($/).each{
|x|
if x =~ /(^.{6,18})(532064.{9}).(814(\d){1,13})/;
printf("%d %s %s","0x"+$1,$2,$3)
if x =~ /(05)(9[0-9])([1-9][0-9a-f]|0[e-f]0[1-9a].{26,28}0[0-1]8.*?)(940e)(.{28})/;
str=$2+$3+$4+$5
map={/91../ => " PROD_1 ",/92../ => " PROD_2 ",/93../ => "PROD_3"}
map.each {|k,v| str.sub!(k,v)}
printf(" %s\n",str)
else
printf("\n")
end
end
}' file
I hope be able to complete the desired output and if not I'll come back to request help
@pan64 - I did like the tr and paste methods, I am curious though how applying this to a 4GB file would affect performance?
It probably depends on the performance of the /tmp filesystem, and the related resources. Otherwise it should be ok.
From the other hand your script will not work if ff77 is splitted into two lines.
From the other hand your script will not work if ff77 is splitted into two lines.
Actually the new code specifically targets the fact that ff77 may split over multiple lines. It is with the use of the split function that the code now deals with
a portion of the line that was originally between 2 ff77's and once the new lines are removed we now split on ff77 and process each portion accordingly.
@OP
unless is the opposite of an if, like while and until. So one tests for a positive to enter and the other for a negative.
The initial reason for the unless was to skip lines / entries that do not contain the first regex.
From a performance stand point I guess the only real difference is that once the unless is true we continue the next loop immediately, whereas when the if is found to be false
it will jump to the 'end' to see if anything is after it to then continue the next loop. I would imagine there should be no directly visible impact in this though as they both result
in the next loop being iterated over.
May I ask that you include an example output of what you would like to see from the previous input example used in post #13
Note: Not sure how often you will need to run this type of thing, but as it is getting longer I would suggest moving away from the single line entry and placing it inside a script.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.