LinuxQuestions.org
Go Job Hunting at the LQ Job Marketplace
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 01-05-2012, 05:54 AM   #1
jfkse7en
LQ Newbie
 
Registered: Dec 2005
Posts: 3

Rep: Reputation: 0
Question SED search and replace fields in a fixed position based on a condition.


Hi,

I am having this super complicated problem that I hope someone will be able to shed some light on.

I have many files (>1MB each) containing millions of records.
Each record have a fixed number of characters (e.g. 50 length), with each field having a fixed position.

Quote:
E.g. 4 records as below:
233450212 20111230 90354332 101010 2A1
233450213 20111230 90354B32 101011 2A2
233450214 20111231 9035433A 101012 2A3
233450215 20111231 90354331 101013 2A4

The description of the fields are as follows:
Pos. 1-9 ID
Pos. 11-18 Date
Pos. 20-39 Phone
Pos. 41-46 Time
Pos. 48-50 Checksum
(Somehow the trailing spaces did not get displayed corrected after the 3rd field. Please see this link on how it should look like.)
I would like to remove the subsequent string that contains any alphabet in the Phone field. In the example shown above, the 2nd record's '90354B32 ' (with the trailing spaces) will be changed to '90354 ' instead.
I.e. If the 3rd field has not alphabet, then the line should remain intact. If that field contains alphabet, the alphabet and the subsequent characters should be replaced with spaces.

The output should be as follows:
Quote:
233450212 20111230 90354332 101010 2A1
233450213 20111230 90354 101011 2A2
233450214 20111231 9035433 101012 2A3
233450215 20111231 90354331 101013 2A4
(Somehow the trailing spaces did not get displayed corrected after the 3rd field. Please see this 2nd link on how it should look like.)
I have searched everywhere but I can either find out how to search and replace based on the position only, or search and replace based on matching values. But this is a combination which I can't seems to find any solution at all.

The main issue is that the files must be processed in an efficient way hence i think the best way forward is a combination of 'sed' & 'awk' commands.

I thought the below code should be able to print out the 2 records that contain the fields with alphabet.
Code:
sed "/^(.{20})(.{20})[A-Z]/ p"
But it doesn't seemed to work.

Thanks in advance!
 
Old 01-05-2012, 10:50 AM   #2
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,693

Rep: Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988
If you use code instead of quote tags your formatting will remain.

As to the problem, try using awk and work on the third field, something like:
Code:
awk '$3 ~ /[A-Z]/{gsub(/[A-Z].*/,"",$3)}1' file
 
1 members found this post helpful.
Old 01-05-2012, 11:10 AM   #3
Cedrik
Senior Member
 
Registered: Jul 2004
Distribution: Slackware
Posts: 2,140

Rep: Reputation: 242Reputation: 242Reputation: 242
You could add
Code:
$3=sprintf("%-8s", $3);
... after the gsub expression, in grail's code, to right-pad $3
(assuming field's fixed length is 8 chars)
 
Old 01-05-2012, 06:42 PM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950
If you have access to a reasonably modern version of gawk, you can also use the FIELDWIDTHS variable to split the line according to fixed column positions.

http://www.gnu.org/software/gawk/man...tant-Size.html

It may be useful if any of the fields themselves could contain whitespace.
 
Old 01-05-2012, 08:58 PM   #5
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 943Reputation: 943Reputation: 943Reputation: 943Reputation: 943Reputation: 943Reputation: 943Reputation: 943
How about
Code:
awk -v c=20 -v n=9 'BEGIN { RS = "(\n|\r|\r\n|\n\r)"; FS = "[\n\r]"; RT = "\n";
                            sp = "                ";
                            while (length(sp) < n) sp = sp sp;
                            sp = substr(sp, 1, n);
                          }
                    { s = $0;
                      i = match(substr(s, c, n), /[A-Za-z]/);
                      if (i > 0) s = substr(s, 1, c+i-2) substr(sp, 1, n-i+1) substr(s, c+n);
                      printf("%s%s", s, RT);
                    }' input-file > output-file
I added the semicolons, so you can cram the entire thing on one single line if you want.

On the first line, c defines the first column in the desired field (first column being column 1), and n is the number of characters in the column. If your file contains non-ASCII characters, you need to use a matching locale: define LANG and LC_ALL environment variables accordingly. At least GNU awk will then calculate characters and not bytes.

The BEGIN rule sets the record separator to any newline convention. It will set the field separator to a newline character, so awk will not split the records into fields. GNU awk (gawk) will set RT to the string that matched the record separator for each record; the snippet uses it to retain whatever newlines you use. Since other awk variants do not provide RT, it sets it to UNIX newline, so that they'll work too, just use \n newlines in the output.

The main logic is in the default rule. s is set to the complete record. This is an optimization; if we modified $0 directly, awk would every time see if it needs to be resplit, wasting CPU time. i will contain the index of the first letter within the field, or 0 if the field does not contain letters.

If your input may contain non-ASCII letters, you might wish to use a different pattern, for example /[^0-9]/ to look for any non-digit. Or /[^-+0-9 ]/ to accept digits, space, plus + and minus -, but nothing else. In principle, it is always better to check if the string contains only acceptable characters, rather than to check for unacceptable characters. You can always miss some, after all.

If the field contains a letter, then the entire record is reconstructed. The first substr() retains everything before the current field, and the current field before the match. The second substr() adds the proper number of spaces, and the third retains everything after the field.

Given this input,
Code:
233450212 20111230 90354332 101010 2A1
233450213 20111230 90354B32 101011 2A2
233450214 20111231 9035433A 101012 2A3
233450215 20111231 90354331 101013 2A4
the command above will yield
Code:
233450212 20111230 90354332 101010 2A1
233450213 20111230 90354    101011 2A2
233450214 20111231 9035433  101012 2A3
233450215 20111231 90354331 101013 2A4
The command does not rely on spaces or field separators, only on c and n .

If you use GNU awk (gawk), you'll retain the newline convention. Any newline convention is accepted in the input by all awk variants, but other awk variants will convert the newlines to UNIX newlines ("\n") in the output.

Hope this helps,
 
1 members found this post helpful.
Old 01-05-2012, 09:41 PM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,165

Rep: Reputation: 306Reputation: 306Reputation: 306Reputation: 306
Warning: I am a newbie. I don't know awk (yet) and always prefer to avoid explicit loops. Here's my proposed solution.

Code:
  "  cut -c1-27 <" InFile ,
  "| sed 's/[A-Z].*/       /g' " ,
  "| cut -c1-27" ,
  ">" Work1

  "  cut -c29- <" InFile ,
  ">" Work2

  "  paste -d' ' " Work1 Work2 ,
  ">" OutFile
Daniel B. Martin
 
Old 01-06-2012, 01:34 AM   #7
jfkse7en
LQ Newbie
 
Registered: Dec 2005
Posts: 3

Original Poster
Rep: Reputation: 0
Thumbs up [Solved]

Hi all,

Thanks! You guys are really great!

Noted on the [code] thingy.

grail's code is short and sweet and it worked. But there are actually lots of trailing spaces in the rest of the text file that is trimmed. I am not sure how to append back the spaces based on Cedrik's example.
I tried commands like
Code:
awk '$3 ~ /[A-Z]/{gsub(/[A-Z].*/,"",sprintf("%-8s",$3))}1'
or
awk '$3 ~ /[A-Z]/{gsub(/[A-Z].*/,"",$3=sprintf("%-8s",$3))}1'
and got syntax error.

But I forgot to add that there are actually other fields after the 5 fields and there are lots of trailing spaces everywhere.

Anyway, Nominal Animal's code is fantastic! Although a tat long, it solves the issue 100%. And thanks for making the effort to explain the code too. Really appreciate it!

Sorry Daniel, didn't try out your code.

Cheers!
 
Old 01-06-2012, 07:14 AM   #8
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,165

Rep: Reputation: 306Reputation: 306Reputation: 306Reputation: 306
Quote:
Originally Posted by jfkse7en View Post
Sorry Daniel, didn't try out your code.
Gosh, I wish you had. You might discover that
(1) Some problems have more than one solution, and
(2) With huge files you may find one of those solutions runs *much* faster than the others.

Technical intuition leads me to suspect something which has not been mentioned. The delimiter following the third data field is not a blank, it is a tab character. If this is the case it can work to our advantage. Try this pipe:
Code:
  "  cat <" InFile ,
  "| sed 's/[A-Z].*\t/ /' " ,
  "| sed 's/\t/ /' "
  ">" OutFile
You may discover that it has good performance and also handles the data fields and trailing blanks which were not mentioned in your original post.

Daniel B. Martin

Last edited by danielbmartin; 01-06-2012 at 09:02 AM. Reason: Clarity
 
  


Reply

Tags
condition, fixed, position, replace, search


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] replace specific position with sed nushki Programming 6 01-27-2012 05:36 AM
Using sed - search and replace seebee Programming 5 06-07-2011 10:47 PM
Using sed/awk to replace a string at a given position in anoopvraj Linux - Newbie 6 05-30-2009 08:59 AM
sed search replace tomerbd1 Linux - General 9 04-10-2008 05:31 AM
sed question for search and replace jakev383 Linux - General 8 05-05-2007 06:40 AM


All times are GMT -5. The time now is 05:56 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration