LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 12-17-2012, 10:53 AM   #1
atjurhs
Member
 
Registered: Aug 2012
Posts: 190

Rep: Reputation: Disabled
a really tough (at least for me) colum data file missing value problem


Good morning guys,

I’m having a really hard time (in fact I’m completely stuck) trying to process a space separated data file because one of the columns is sometimes filled with a space where it should have a 0. The problem happens on the 39th column. Most of the time the 39th column has numbers, but sometimes ( maybe 20% of the time ) where the 39th column should have a 0, it has a space, and on a space separated data file that shifts all the data from the 40th column and on to the left one column, so now all the data that comes after the 39th is shifted over by one column and things are all screwed up.

What I think will help is that 39th column should always contains an integer, and that the integer is always aligned to the right side of the column, plus the 38th and 40th columns always have 6 numbers after it’s decimal and there are always 9 “steps” from the last number in the 38th column to the last number in the 39th column. Or you could also count that there's always 15 “steps” from column 38th’s decimal point to the last number in the 39th column's integer.

Here’s what it looks like, and so you can see the problem easier, I will replace spaces withr dashes, and I’ll write the decimal part of the 38th and 40th columns as “123456”, but you know that’s not how the data file really is, so just for ease of understanding

Code:
       column38   column39          column40       column41
------723.123456-----1321-------9462.123456-----FALSE------etc.
--2384311.123456--------5--------741.123456-----FALSE------etc.
-----3276.123456------268-----194532.123456-----TRUE-------etc.
--4563783.123456-------13-----438378.123456-----FALSE------etc.
------354.123456--------2-------5634.123456-----FALSE------etc.
-------41.123456------------------81.123456-----FALSE------etc.
-----6641.123456---------------67534.123456-----FALSE------etc.
---136671.123456-------67--------675.123456-----FALSE------etc.
-------98.123456-------43-----786344.123456-----FALSE------etc.

so in this example the 6th and 7th rows will mess everything up.
In pseudo-code I think the answer is:
1.Count over to the 38th column
2.Count over 9 “steps” from the last number in the 38th column
3.If the 9th “step” is not a number, make it a 0
4.Else continue to go down the rows looking for the problem

In pseudo-code I think but, idk
step 1 is easy
I have no idea how to do step 2
step 3 is maybe a simple “if” statement ?
and step 4 will happen just because it’s an awk/sed/of bash script, whatever yes?

Thanks for helping me, Tabitha
 
Old 12-17-2012, 11:52 AM   #2
atjurhs
Member
 
Registered: Aug 2012
Posts: 190

Original Poster
Rep: Reputation: Disabled
is it maybe some sort of byte counting after the 38th column, but I don't know how to do that
 
Old 12-17-2012, 12:00 PM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,505

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
I am a little confused by what you want to do?

You have file in the format you mentioned (yes or no)?

What do you wish to do with the data? ie put it another file?

What have you tried in the way of solving your problem, outside of pseudo code?
 
Old 12-17-2012, 12:12 PM   #4
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 5,120

Rep: Reputation: 876Reputation: 876Reputation: 876Reputation: 876Reputation: 876Reputation: 876Reputation: 876
Code:
awk 'substr($0,26,1) == " " {print}' test.lst
 
Old 12-17-2012, 12:30 PM   #5
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850
With Perl
Code:
#!/usr/bin/perl

use strict ;
use warnings ;
use feature 'say' ;

while (<STDIN>) {
        my @ar = split /\s+/, $_ ;
        if ( $ar[3] =~ /FALSE|TRUE/ ) {
                splice @ar, 2, 0, "0" ;
                say "@ar";
        }
}
I understood that your file has whitespace as delimiters
Code:
markus@samsung:~/Programmierung/perl$ cat text.txt
      723.123456     1321       9462.123456     FALSE      etc.
  2384311.123456        5        741.123456     FALSE      etc.
     3276.123456      268     194532.123456     TRUE       etc.
  4563783.123456       13     438378.123456     FALSE      etc.
      354.123456        2       5634.123456     FALSE      etc.
       41.123456                  81.123456     FALSE      etc.
     6641.123456               67534.123456     FALSE      etc.
   136671.123456       67        675.123456     FALSE      etc.
       98.123456       43     786344.123456     FALSE      etc.
markus@samsung:~/Programmierung/perl$ ./script.pl <text.txt
 41.123456 0 81.123456 FALSE etc.
 6641.123456 0 67534.123456 FALSE etc.
where the name of the script is script.pl and the inputfile is text.txt.

It searches where FALSE/TRUE is in the wrong column and inserts a 0 in the second column, you will have to change the column numbers.

Here it prints only the changed lines.

Markus

Last edited by markush; 12-17-2012 at 12:31 PM.
 
Old 12-17-2012, 01:32 PM   #6
atjurhs
Member
 
Registered: Aug 2012
Posts: 190

Original Poster
Rep: Reputation: Disabled
yep, the file is white space delimeted.

the fixed file needs to look like this:
Code:
       column38   column39          column40       column41
------723.123456-----1321-------9462.123456-----FALSE------etc.
--2384311.123456--------5--------741.123456-----FALSE------etc.
-----3276.123456------268-----194532.123456-----TRUE-------etc.
--4563783.123456-------13-----438378.123456-----FALSE------etc.
------354.123456--------2-------5634.123456-----FALSE------etc.
-------41.123456--------0---------81.123456-----FALSE------etc.
-----6641.123456--------0------67534.123456-----FALSE------etc.
---136671.123456-------67--------675.123456-----FALSE------etc.
-------98.123456-------43-----786344.123456-----FALSE------etc.
so now rows 6 and 7 won't cause other programs to crash. without a fix (whatever it is) row 6 column 39 would have the value of 81.123456 and column 40 would have FALSE and so on down the row, and the same kind of thing would hapen in row 7, column 39 would have the value 67534.123456 and column 40 would have FALSE and so on down the row

Last edited by atjurhs; 12-17-2012 at 01:34 PM.
 
Old 12-17-2012, 01:40 PM   #7
atjurhs
Member
 
Registered: Aug 2012
Posts: 190

Original Poster
Rep: Reputation: Disabled
markush that might just work??? but I will need it to print out the whole file, not just the fixed rows

couple questions.....

the 3 in ar[3] corresponds to the FALSE|TRUE column because it is 0 based?

the 2 in the line splice @ar, 2, 0, "0" ; means go back 2 columns

the "0" in the line splice @ar, 2, 0, "0" ; means pad with a 0

what does the first 0 mean in that line

thank you sooooo much for your help!!! Tabby
 
Old 12-17-2012, 01:45 PM   #8
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850
The splice command of Perl takes these arguments, the array, the position, the length and the value to insert. the 0 is the length.

You should read
Code:
perldoc -f splice
You can use this code for the whole file, you will have to put the "say" line at the end of the while loop
Code:
#!/usr/bin/perl

use strict ;
use warnings ;
use feature 'say' ;

while (<STDIN>) {
        my @ar = split /\s+/, $_ ;
        if ( $ar[3] =~ /FALSE|TRUE/ ) {
                splice @ar, 2, 0, "0" ;
        }
        say "@ar";
}
Markus

Last edited by markush; 12-17-2012 at 04:02 PM. Reason: typo
 
Old 12-17-2012, 01:55 PM   #9
atjurhs
Member
 
Registered: Aug 2012
Posts: 190

Original Poster
Rep: Reputation: Disabled
sounds great, I'll give it a try.....


thanks again, Tabby
 
Old 12-17-2012, 02:08 PM   #10
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850
For the formatted output you should take a look at Perls write command.
Code:
perldoc -f write
and
Code:
perldoc -f format
and here http://stackoverflow.com/questions/3...tput-with-perl

Markus
 
Old 12-17-2012, 03:56 PM   #11
atjurhs
Member
 
Registered: Aug 2012
Posts: 190

Original Poster
Rep: Reputation: Disabled
it gave me the error

Code:
Can't locate feature.pm in @INC  (@INC contains: /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 .) at ./script.pl line 5
WOW, that was alot of ugly typing!

ps. trying to help myself, I tried to do a perldoc on 'feature' but it said there is no documentation for perl function 'feature' thenI looked on Google and it said I have to have perl 5.10 changing to 5.10 is probably not going to happen but I'll ask

please say it's fixable by another way

Last edited by atjurhs; 12-17-2012 at 04:03 PM.
 
Old 12-17-2012, 04:02 PM   #12
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850Reputation: 850
You can substitute
Code:
say "@ar" ;
with
Code:
print "@ar\n" ;
say is the same as print but it adds a newline automatically.

Markus
 
Old 12-17-2012, 05:42 PM   #13
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.9, Centos 7.3
Posts: 17,362

Rep: Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377
Actually, I'd say the Perl unpack fn is ideal for this http://linux.die.net/man/1/perlpacktut
 
Old 12-18-2012, 06:41 AM   #14
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,552

Rep: Reputation: 1433Reputation: 1433Reputation: 1433Reputation: 1433Reputation: 1433Reputation: 1433Reputation: 1433Reputation: 1433Reputation: 1433Reputation: 1433
Given the text.txt file in post #5, then this sed command (which matches the first 24 characters in a line followed by a space, then changes the space to zero)
Code:
sed 's:\(^.\{24\}\) :\10:g' text.txt
outputs
Code:
      723.123456     1321       9462.123456     FALSE      etc.
  2384311.123456        5        741.123456     FALSE      etc.
     3276.123456      268     194532.123456     TRUE       etc.
  4563783.123456       13     438378.123456     FALSE      etc.
      354.123456        2       5634.123456     FALSE      etc.
       41.123456        0         81.123456     FALSE      etc.
     6641.123456        0      67534.123456     FALSE      etc.
   136671.123456       67        675.123456     FALSE      etc.
       98.123456       43     786344.123456     FALSE      etc.
You could use the -i option to sed to make the changes in the file permanently.
The above assumes that your file has fixed width columns.
 
Old 12-18-2012, 06:47 AM   #15
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,505

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
Well as an alternative:
Code:
ruby -ane '$F.insert(1,0) if $F.length == 4;puts $F.join("\t")' file
Of course you would need to change numbers to reflect your actual data, but the ones here worked with the example.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Over my head with this problem - Access the data in .fbd data file BobNutfield Linux - Server 3 02-20-2011 01:48 PM
Weird xmgrace problem:Missing import data options Euler2 Linux - Software 1 04-13-2010 12:07 PM
Tough exmh/mh question - /etc/aliases file cskip Linux - Software 2 07-29-2009 11:34 AM
Sort file based on only ONE colum smart_sagittari Linux - Newbie 6 07-08-2005 12:25 AM
gprof: gmon.out file is missing call-graph data hemk76 Programming 0 01-07-2005 11:54 PM


All times are GMT -5. The time now is 10:19 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration