[SOLVED] SED search and replace fields in a fixed position based on a condition.

jfkse7en · 01-05-2012, 04:54 AM

Hi,

I am having this super complicated problem that I hope someone will be able to shed some light on.

I have many files (>1MB each) containing millions of records.
Each record have a fixed number of characters (e.g. 50 length), with each field having a fixed position.

Quote:

E.g. 4 records as below:
233450212 20111230 90354332 101010 2A1
233450213 20111230 90354B32 101011 2A2
233450214 20111231 9035433A 101012 2A3
233450215 20111231 90354331 101013 2A4

The description of the fields are as follows:
Pos. 1-9 ID
Pos. 11-18 Date
Pos. 20-39 Phone
Pos. 41-46 Time
Pos. 48-50 Checksum
(Somehow the trailing spaces did not get displayed corrected after the 3rd field. Please see this link on how it should look like.)

I would like to remove the subsequent string that contains any alphabet in the Phone field. In the example shown above, the 2nd record's '90354B32 ' (with the trailing spaces) will be changed to '90354 ' instead.
I.e. If the 3rd field has not alphabet, then the line should remain intact. If that field contains alphabet, the alphabet and the subsequent characters should be replaced with spaces.

The output should be as follows:

Quote:

233450212 20111230 90354332 101010 2A1
233450213 20111230 90354 101011 2A2
233450214 20111231 9035433 101012 2A3
233450215 20111231 90354331 101013 2A4
(Somehow the trailing spaces did not get displayed corrected after the 3rd field. Please see this 2nd link on how it should look like.)

I have searched everywhere but I can either find out how to search and replace based on the position only, or search and replace based on matching values. But this is a combination which I can't seems to find any solution at all.

The main issue is that the files must be processed in an efficient way hence i think the best way forward is a combination of 'sed' & 'awk' commands.

I thought the below code should be able to print out the 2 records that contain the fields with alphabet.

Code:

sed "/^(.{20})(.{20})[A-Z]/ p"

But it doesn't seemed to work.

Thanks in advance!

grail · 01-05-2012, 09:50 AM

If you use code instead of quote tags your formatting will remain.

As to the problem, try using awk and work on the third field, something like:

Code:

awk '$3 ~ /[A-Z]/{gsub(/[A-Z].*/,"",$3)}1' file

Cedrik · 01-05-2012, 10:10 AM

You could add

Code:

$3=sprintf("%-8s", $3);

... after the gsub expression, in grail's code, to right-pad $3
(assuming field's fixed length is 8 chars)

David the H. · 01-05-2012, 05:42 PM

If you have access to a reasonably modern version of gawk, you can also use the FIELDWIDTHS variable to split the line according to fixed column positions.

http://www.gnu.org/software/gawk/man...tant-Size.html

It may be useful if any of the fields themselves could contain whitespace.

Nominal Animal · 01-05-2012, 07:58 PM

How about

Code:

awk -v c=20 -v n=9 'BEGIN { RS = "(\n|\r|\r\n|\n\r)"; FS = "[\n\r]"; RT = "\n";
                            sp = "                ";
                            while (length(sp) < n) sp = sp sp;
                            sp = substr(sp, 1, n);
                          }
                    { s = $0;
                      i = match(substr(s, c, n), /[A-Za-z]/);
                      if (i > 0) s = substr(s, 1, c+i-2) substr(sp, 1, n-i+1) substr(s, c+n);
                      printf("%s%s", s, RT);
                    }' input-file > output-file

I added the semicolons, so you can cram the entire thing on one single line if you want.

On the first line, c defines the first column in the desired field (first column being column 1), and n is the number of characters in the column. If your file contains non-ASCII characters, you need to use a matching locale: define LANG and LC_ALL environment variables accordingly. At least GNU awk will then calculate characters and not bytes.

The BEGIN rule sets the record separator to any newline convention. It will set the field separator to a newline character, so awk will not split the records into fields. GNU awk (gawk) will set RT to the string that matched the record separator for each record; the snippet uses it to retain whatever newlines you use. Since other awk variants do not provide RT, it sets it to UNIX newline, so that they'll work too, just use \n newlines in the output.

The main logic is in the default rule. s is set to the complete record. This is an optimization; if we modified $0 directly, awk would every time see if it needs to be resplit, wasting CPU time. i will contain the index of the first letter within the field, or 0 if the field does not contain letters.

If your input may contain non-ASCII letters, you might wish to use a different pattern, for example /[^0-9]/ to look for any non-digit. Or /[^-+0-9 ]/ to accept digits, space, plus + and minus -, but nothing else. In principle, it is always better to check if the string contains only acceptable characters, rather than to check for unacceptable characters. You can always miss some, after all.

If the field contains a letter, then the entire record is reconstructed. The first substr() retains everything before the current field, and the current field before the match. The second substr() adds the proper number of spaces, and the third retains everything after the field.

Given this input,

Code:

233450212 20111230 90354332 101010 2A1
233450213 20111230 90354B32 101011 2A2
233450214 20111231 9035433A 101012 2A3
233450215 20111231 90354331 101013 2A4

the command above will yield

Code:

233450212 20111230 90354332 101010 2A1
233450213 20111230 90354    101011 2A2
233450214 20111231 9035433  101012 2A3
233450215 20111231 90354331 101013 2A4

The command does not rely on spaces or field separators, only on c and n .

If you use GNU awk (gawk), you'll retain the newline convention. Any newline convention is accepted in the input by all awk variants, but other awk variants will convert the newlines to UNIX newlines ("\n") in the output.

Hope this helps,

danielbmartin · 01-05-2012, 08:41 PM

Warning: I am a newbie. I don't know awk (yet) and always prefer to avoid explicit loops. Here's my proposed solution.

Code:

  "  cut -c1-27 <" InFile ,
  "| sed 's/[A-Z].*/       /g' " ,
  "| cut -c1-27" ,
  ">" Work1

  "  cut -c29- <" InFile ,
  ">" Work2

  "  paste -d' ' " Work1 Work2 ,
  ">" OutFile

Daniel B. Martin

jfkse7en · 01-06-2012, 12:34 AM

Hi all,

Thanks! You guys are really great!

Noted on the [code] thingy.

grail's code is short and sweet and it worked. But there are actually lots of trailing spaces in the rest of the text file that is trimmed. I am not sure how to append back the spaces based on Cedrik's example.
I tried commands like

Code:

awk '$3 ~ /[A-Z]/{gsub(/[A-Z].*/,"",sprintf("%-8s",$3))}1'
or
awk '$3 ~ /[A-Z]/{gsub(/[A-Z].*/,"",$3=sprintf("%-8s",$3))}1'

and got syntax error.

But I forgot to add that there are actually other fields after the 5 fields and there are lots of trailing spaces everywhere.

Anyway, Nominal Animal's code is fantastic! Although a tat long, it solves the issue 100%. And thanks for making the effort to explain the code too. Really appreciate it!

Sorry Daniel, didn't try out your code.

Cheers!

danielbmartin · 01-06-2012, 06:14 AM

Quote:

Originally Posted by jfkse7en

Sorry Daniel, didn't try out your code.

Gosh, I wish you had. You might discover that
(1) Some problems have more than one solution, and
(2) With huge files you may find one of those solutions runs *much* faster than the others.

Technical intuition leads me to suspect something which has not been mentioned. The delimiter following the third data field is not a blank, it is a tab character. If this is the case it can work to our advantage. Try this pipe:

Code:

  "  cat <" InFile ,
  "| sed 's/[A-Z].*\t/ /' " ,
  "| sed 's/\t/ /' "
  ">" OutFile

You may discover that it has good performance and also handles the data fields and trailing blanks which were not mentioned in your original post.

Daniel B. Martin