Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
 |
GNU/Linux Basic Guide
This 255-page guide will provide you with the keys to understand the philosophy of free software, teach you how to use and handle it, and give you the tools required to move easily in the world of GNU/Linux. Many users and administrators will be taking their first steps with this GNU/Linux Basic guide and it will show you how to approach and solve the problems you encounter.
Click Here to receive this Complete Guide absolutely free. |
|
 |
01-05-2012, 04:54 AM
|
#1
|
|
LQ Newbie
Registered: Dec 2005
Posts: 3
Rep:
|
SED search and replace fields in a fixed position based on a condition.
Hi,
I am having this super complicated problem that I hope someone will be able to shed some light on.
I have many files (>1MB each) containing millions of records.
Each record have a fixed number of characters (e.g. 50 length), with each field having a fixed position.
Quote:
E.g. 4 records as below:
233450212 20111230 90354332 101010 2A1
233450213 20111230 90354B32 101011 2A2
233450214 20111231 9035433A 101012 2A3
233450215 20111231 90354331 101013 2A4
The description of the fields are as follows:
Pos. 1-9 ID
Pos. 11-18 Date
Pos. 20-39 Phone
Pos. 41-46 Time
Pos. 48-50 Checksum
(Somehow the trailing spaces did not get displayed corrected after the 3rd field. Please see this link on how it should look like.)
|
I would like to remove the subsequent string that contains any alphabet in the Phone field. In the example shown above, the 2nd record's '90354B32 ' (with the trailing spaces) will be changed to '90354 ' instead.
I.e. If the 3rd field has not alphabet, then the line should remain intact. If that field contains alphabet, the alphabet and the subsequent characters should be replaced with spaces.
The output should be as follows:
Quote:
233450212 20111230 90354332 101010 2A1
233450213 20111230 90354 101011 2A2
233450214 20111231 9035433 101012 2A3
233450215 20111231 90354331 101013 2A4
(Somehow the trailing spaces did not get displayed corrected after the 3rd field. Please see this 2nd link on how it should look like.)
|
I have searched everywhere but I can either find out how to search and replace based on the position only, or search and replace based on matching values. But this is a combination which I can't seems to find any solution at all.
The main issue is that the files must be processed in an efficient way hence i think the best way forward is a combination of 'sed' & 'awk' commands.
I thought the below code should be able to print out the 2 records that contain the fields with alphabet.
Code:
sed "/^(.{20})(.{20})[A-Z]/ p"
But it doesn't seemed to work.
Thanks in advance!
|
|
|
|
01-05-2012, 09:50 AM
|
#2
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,324
|
If you use code instead of quote tags your formatting will remain.
As to the problem, try using awk and work on the third field, something like:
Code:
awk '$3 ~ /[A-Z]/{gsub(/[A-Z].*/,"",$3)}1' file
|
|
|
1 members found this post helpful.
|
01-05-2012, 10:10 AM
|
#3
|
|
Senior Member
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140
|
You could add
Code:
$3=sprintf("%-8s", $3);
... after the gsub expression, in grail's code, to right-pad $3
(assuming field's fixed length is 8 chars)
|
|
|
|
01-05-2012, 05:42 PM
|
#4
|
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,581
|
If you have access to a reasonably modern version of gawk, you can also use the FIELDWIDTHS variable to split the line according to fixed column positions.
http://www.gnu.org/software/gawk/man...tant-Size.html
It may be useful if any of the fields themselves could contain whitespace.
|
|
|
|
01-05-2012, 07:58 PM
|
#5
|
|
Senior Member
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
|
How about
Code:
awk -v c=20 -v n=9 'BEGIN { RS = "(\n|\r|\r\n|\n\r)"; FS = "[\n\r]"; RT = "\n";
sp = " ";
while (length(sp) < n) sp = sp sp;
sp = substr(sp, 1, n);
}
{ s = $0;
i = match(substr(s, c, n), /[A-Za-z]/);
if (i > 0) s = substr(s, 1, c+i-2) substr(sp, 1, n-i+1) substr(s, c+n);
printf("%s%s", s, RT);
}' input-file > output-file
I added the semicolons, so you can cram the entire thing on one single line if you want.
On the first line, c defines the first column in the desired field (first column being column 1), and n is the number of characters in the column. If your file contains non-ASCII characters, you need to use a matching locale: define LANG and LC_ALL environment variables accordingly. At least GNU awk will then calculate characters and not bytes.
The BEGIN rule sets the record separator to any newline convention. It will set the field separator to a newline character, so awk will not split the records into fields. GNU awk (gawk) will set RT to the string that matched the record separator for each record; the snippet uses it to retain whatever newlines you use. Since other awk variants do not provide RT, it sets it to UNIX newline, so that they'll work too, just use \n newlines in the output.
The main logic is in the default rule. s is set to the complete record. This is an optimization; if we modified $0 directly, awk would every time see if it needs to be resplit, wasting CPU time. i will contain the index of the first letter within the field, or 0 if the field does not contain letters.
If your input may contain non-ASCII letters, you might wish to use a different pattern, for example /[^0-9]/ to look for any non-digit. Or /[^-+0-9 ]/ to accept digits, space, plus + and minus -, but nothing else. In principle, it is always better to check if the string contains only acceptable characters, rather than to check for unacceptable characters. You can always miss some, after all.
If the field contains a letter, then the entire record is reconstructed. The first substr() retains everything before the current field, and the current field before the match. The second substr() adds the proper number of spaces, and the third retains everything after the field.
Given this input,
Code:
233450212 20111230 90354332 101010 2A1
233450213 20111230 90354B32 101011 2A2
233450214 20111231 9035433A 101012 2A3
233450215 20111231 90354331 101013 2A4
the command above will yield
Code:
233450212 20111230 90354332 101010 2A1
233450213 20111230 90354 101011 2A2
233450214 20111231 9035433 101012 2A3
233450215 20111231 90354331 101013 2A4
The command does not rely on spaces or field separators, only on c and n .
If you use GNU awk ( gawk), you'll retain the newline convention. Any newline convention is accepted in the input by all awk variants, but other awk variants will convert the newlines to UNIX newlines ("\n") in the output.
Hope this helps,
|
|
|
1 members found this post helpful.
|
01-05-2012, 08:41 PM
|
#6
|
|
Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 765
Rep: 
|
Warning: I am a newbie. I don't know awk (yet) and always prefer to avoid explicit loops. Here's my proposed solution.
Code:
" cut -c1-27 <" InFile ,
"| sed 's/[A-Z].*/ /g' " ,
"| cut -c1-27" ,
">" Work1
" cut -c29- <" InFile ,
">" Work2
" paste -d' ' " Work1 Work2 ,
">" OutFile
Daniel B. Martin
|
|
|
|
01-06-2012, 12:34 AM
|
#7
|
|
LQ Newbie
Registered: Dec 2005
Posts: 3
Original Poster
Rep:
|
[Solved]
Hi all,
Thanks! You guys are really great!
Noted on the [code] thingy.
grail's code is short and sweet and it worked. But there are actually lots of trailing spaces in the rest of the text file that is trimmed. I am not sure how to append back the spaces based on Cedrik's example.
I tried commands like
Code:
awk '$3 ~ /[A-Z]/{gsub(/[A-Z].*/,"",sprintf("%-8s",$3))}1'
or
awk '$3 ~ /[A-Z]/{gsub(/[A-Z].*/,"",$3=sprintf("%-8s",$3))}1'
and got syntax error.
But I forgot to add that there are actually other fields after the 5 fields and there are lots of trailing spaces everywhere.
Anyway, Nominal Animal's code is fantastic! Although a tat long, it solves the issue 100%. And thanks for making the effort to explain the code too. Really appreciate it!
Sorry Daniel, didn't try out your code.
Cheers!
|
|
|
|
01-06-2012, 06:14 AM
|
#8
|
|
Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 765
Rep: 
|
Quote:
Originally Posted by jfkse7en
Sorry Daniel, didn't try out your code.
|
Gosh, I wish you had. You might discover that
(1) Some problems have more than one solution, and
(2) With huge files you may find one of those solutions runs *much* faster than the others.
Technical intuition leads me to suspect something which has not been mentioned. The delimiter following the third data field is not a blank, it is a tab character. If this is the case it can work to our advantage. Try this pipe:
Code:
" cat <" InFile ,
"| sed 's/[A-Z].*\t/ /' " ,
"| sed 's/\t/ /' "
">" OutFile
You may discover that it has good performance and also handles the data fields and trailing blanks which were not mentioned in your original post.
Daniel B. Martin
Last edited by danielbmartin; 01-06-2012 at 08:02 AM.
Reason: Clarity
|
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 09:12 AM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|