SED search and replace fields in a fixed position based on a condition.
Hi,
I am having this super complicated problem that I hope someone will be able to shed some light on. I have many files (>1MB each) containing millions of records. Each record have a fixed number of characters (e.g. 50 length), with each field having a fixed position. Quote:
I.e. If the 3rd field has not alphabet, then the line should remain intact. If that field contains alphabet, the alphabet and the subsequent characters should be replaced with spaces. The output should be as follows: Quote:
The main issue is that the files must be processed in an efficient way hence i think the best way forward is a combination of 'sed' & 'awk' commands. I thought the below code should be able to print out the 2 records that contain the fields with alphabet. Code:
sed "/^(.{20})(.{20})[A-Z]/ p" Thanks in advance! |
If you use code instead of quote tags your formatting will remain.
As to the problem, try using awk and work on the third field, something like: Code:
awk '$3 ~ /[A-Z]/{gsub(/[A-Z].*/,"",$3)}1' file |
You could add
Code:
$3=sprintf("%-8s", $3); (assuming field's fixed length is 8 chars) |
If you have access to a reasonably modern version of gawk, you can also use the FIELDWIDTHS variable to split the line according to fixed column positions.
http://www.gnu.org/software/gawk/man...tant-Size.html It may be useful if any of the fields themselves could contain whitespace. |
How about
Code:
awk -v c=20 -v n=9 'BEGIN { RS = "(\n|\r|\r\n|\n\r)"; FS = "[\n\r]"; RT = "\n"; On the first line, c defines the first column in the desired field (first column being column 1), and n is the number of characters in the column. If your file contains non-ASCII characters, you need to use a matching locale: define LANG and LC_ALL environment variables accordingly. At least GNU awk will then calculate characters and not bytes. The BEGIN rule sets the record separator to any newline convention. It will set the field separator to a newline character, so awk will not split the records into fields. GNU awk (gawk) will set RT to the string that matched the record separator for each record; the snippet uses it to retain whatever newlines you use. Since other awk variants do not provide RT, it sets it to UNIX newline, so that they'll work too, just use \n newlines in the output. The main logic is in the default rule. s is set to the complete record. This is an optimization; if we modified $0 directly, awk would every time see if it needs to be resplit, wasting CPU time. i will contain the index of the first letter within the field, or 0 if the field does not contain letters. If your input may contain non-ASCII letters, you might wish to use a different pattern, for example /[^0-9]/ to look for any non-digit. Or /[^-+0-9 ]/ to accept digits, space, plus + and minus -, but nothing else. In principle, it is always better to check if the string contains only acceptable characters, rather than to check for unacceptable characters. You can always miss some, after all. If the field contains a letter, then the entire record is reconstructed. The first substr() retains everything before the current field, and the current field before the match. The second substr() adds the proper number of spaces, and the third retains everything after the field. Given this input, Code:
233450212 20111230 90354332 101010 2A1 Code:
233450212 20111230 90354332 101010 2A1 If you use GNU awk (gawk), you'll retain the newline convention. Any newline convention is accepted in the input by all awk variants, but other awk variants will convert the newlines to UNIX newlines ("\n") in the output. Hope this helps, |
Warning: I am a newbie. I don't know awk (yet) and always prefer to avoid explicit loops. Here's my proposed solution.
Code:
" cut -c1-27 <" InFile , |
[Solved]
Hi all,
Thanks! You guys are really great! Noted on the [code] thingy. grail's code is short and sweet and it worked. But there are actually lots of trailing spaces in the rest of the text file that is trimmed. I am not sure how to append back the spaces based on Cedrik's example. I tried commands like Code:
awk '$3 ~ /[A-Z]/{gsub(/[A-Z].*/,"",sprintf("%-8s",$3))}1' But I forgot to add that there are actually other fields after the 5 fields and there are lots of trailing spaces everywhere. Anyway, Nominal Animal's code is fantastic! Although a tat long, it solves the issue 100%. And thanks for making the effort to explain the code too. Really appreciate it! Sorry Daniel, didn't try out your code. Cheers! |
Quote:
(1) Some problems have more than one solution, and (2) With huge files you may find one of those solutions runs *much* faster than the others. Technical intuition leads me to suspect something which has not been mentioned. The delimiter following the third data field is not a blank, it is a tab character. If this is the case it can work to our advantage. Try this pipe: Code:
" cat <" InFile , Daniel B. Martin |
All times are GMT -5. The time now is 02:54 AM. |