[SOLVED] SED search and replace fields in a fixed position based on a condition.
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
SED search and replace fields in a fixed position based on a condition.
Hi,
I am having this super complicated problem that I hope someone will be able to shed some light on.
I have many files (>1MB each) containing millions of records.
Each record have a fixed number of characters (e.g. 50 length), with each field having a fixed position.
Quote:
E.g. 4 records as below:
233450212 20111230 90354332 101010 2A1
233450213 20111230 90354B32 101011 2A2
233450214 20111231 9035433A 101012 2A3
233450215 20111231 90354331 101013 2A4
The description of the fields are as follows:
Pos. 1-9 ID
Pos. 11-18 Date
Pos. 20-39 Phone
Pos. 41-46 Time
Pos. 48-50 Checksum
(Somehow the trailing spaces did not get displayed corrected after the 3rd field. Please see this link on how it should look like.)
I would like to remove the subsequent string that contains any alphabet in the Phone field. In the example shown above, the 2nd record's '90354B32 ' (with the trailing spaces) will be changed to '90354 ' instead.
I.e. If the 3rd field has not alphabet, then the line should remain intact. If that field contains alphabet, the alphabet and the subsequent characters should be replaced with spaces.
The output should be as follows:
Quote:
233450212 20111230 90354332 101010 2A1
233450213 20111230 90354 101011 2A2
233450214 20111231 9035433 101012 2A3
233450215 20111231 90354331 101013 2A4
(Somehow the trailing spaces did not get displayed corrected after the 3rd field. Please see this 2nd link on how it should look like.)
I have searched everywhere but I can either find out how to search and replace based on the position only, or search and replace based on matching values. But this is a combination which I can't seems to find any solution at all.
The main issue is that the files must be processed in an efficient way hence i think the best way forward is a combination of 'sed' & 'awk' commands.
I thought the below code should be able to print out the 2 records that contain the fields with alphabet.
If you have access to a reasonably modern version of gawk, you can also use the FIELDWIDTHS variable to split the line according to fixed column positions.
I added the semicolons, so you can cram the entire thing on one single line if you want.
On the first line, c defines the first column in the desired field (first column being column 1), and n is the number of characters in the column. If your file contains non-ASCII characters, you need to use a matching locale: define LANG and LC_ALL environment variables accordingly. At least GNU awk will then calculate characters and not bytes.
The BEGIN rule sets the record separator to any newline convention. It will set the field separator to a newline character, so awk will not split the records into fields. GNU awk (gawk) will set RT to the string that matched the record separator for each record; the snippet uses it to retain whatever newlines you use. Since other awk variants do not provide RT, it sets it to UNIX newline, so that they'll work too, just use \n newlines in the output.
The main logic is in the default rule. s is set to the complete record. This is an optimization; if we modified $0 directly, awk would every time see if it needs to be resplit, wasting CPU time. i will contain the index of the first letter within the field, or 0 if the field does not contain letters.
If your input may contain non-ASCII letters, you might wish to use a different pattern, for example /[^0-9]/ to look for any non-digit. Or /[^-+0-9 ]/ to accept digits, space, plus + and minus -, but nothing else. In principle, it is always better to check if the string contains only acceptable characters, rather than to check for unacceptable characters. You can always miss some, after all.
If the field contains a letter, then the entire record is reconstructed. The first substr() retains everything before the current field, and the current field before the match. The second substr() adds the proper number of spaces, and the third retains everything after the field.
The command does not rely on spaces or field separators, only on c and n .
If you use GNU awk (gawk), you'll retain the newline convention. Any newline convention is accepted in the input by all awk variants, but other awk variants will convert the newlines to UNIX newlines ("\n") in the output.
grail's code is short and sweet and it worked. But there are actually lots of trailing spaces in the rest of the text file that is trimmed. I am not sure how to append back the spaces based on Cedrik's example.
I tried commands like
Code:
awk '$3 ~ /[A-Z]/{gsub(/[A-Z].*/,"",sprintf("%-8s",$3))}1'
or
awk '$3 ~ /[A-Z]/{gsub(/[A-Z].*/,"",$3=sprintf("%-8s",$3))}1'
and got syntax error.
But I forgot to add that there are actually other fields after the 5 fields and there are lots of trailing spaces everywhere.
Anyway, Nominal Animal's code is fantastic! Although a tat long, it solves the issue 100%. And thanks for making the effort to explain the code too. Really appreciate it!
Gosh, I wish you had. You might discover that
(1) Some problems have more than one solution, and
(2) With huge files you may find one of those solutions runs *much* faster than the others.
Technical intuition leads me to suspect something which has not been mentioned. The delimiter following the third data field is not a blank, it is a tab character. If this is the case it can work to our advantage. Try this pipe:
Code:
" cat <" InFile ,
"| sed 's/[A-Z].*\t/ /' " ,
"| sed 's/\t/ /' "
">" OutFile
You may discover that it has good performance and also handles the data fields and trailing blanks which were not mentioned in your original post.
Daniel B. Martin
Last edited by danielbmartin; 01-06-2012 at 08:02 AM.
Reason: Clarity
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.