LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   a really tough (at least for me) colum data file missing value problem (https://www.linuxquestions.org/questions/linux-newbie-8/a-really-tough-at-least-for-me-colum-data-file-missing-value-problem-4175441792/)

atjurhs 12-17-2012 10:53 AM

a really tough (at least for me) colum data file missing value problem
 
Good morning guys,

I’m having a really hard time (in fact I’m completely stuck) trying to process a space separated data file because one of the columns is sometimes filled with a space where it should have a 0. The problem happens on the 39th column. Most of the time the 39th column has numbers, but sometimes ( maybe 20% of the time ) where the 39th column should have a 0, it has a space, and on a space separated data file that shifts all the data from the 40th column and on to the left one column, so now all the data that comes after the 39th is shifted over by one column and things are all screwed up.

What I think will help is that 39th column should always contains an integer, and that the integer is always aligned to the right side of the column, plus the 38th and 40th columns always have 6 numbers after it’s decimal and there are always 9 “steps” from the last number in the 38th column to the last number in the 39th column. Or you could also count that there's always 15 “steps” from column 38th’s decimal point to the last number in the 39th column's integer.

Here’s what it looks like, and so you can see the problem easier, I will replace spaces withr dashes, and I’ll write the decimal part of the 38th and 40th columns as “123456”, but you know that’s not how the data file really is, so just for ease of understanding

Code:

      column38  column39          column40      column41
------723.123456-----1321-------9462.123456-----FALSE------etc.
--2384311.123456--------5--------741.123456-----FALSE------etc.
-----3276.123456------268-----194532.123456-----TRUE-------etc.
--4563783.123456-------13-----438378.123456-----FALSE------etc.
------354.123456--------2-------5634.123456-----FALSE------etc.
-------41.123456------------------81.123456-----FALSE------etc.
-----6641.123456---------------67534.123456-----FALSE------etc.
---136671.123456-------67--------675.123456-----FALSE------etc.
-------98.123456-------43-----786344.123456-----FALSE------etc.

so in this example the 6th and 7th rows will mess everything up.

In pseudo-code I think the answer is:
1.Count over to the 38th column
2.Count over 9 “steps” from the last number in the 38th column
3.If the 9th “step” is not a number, make it a 0
4.Else continue to go down the rows looking for the problem

In pseudo-code I think but, idk
step 1 is easy
I have no idea how to do step 2
step 3 is maybe a simple “if” statement ?
and step 4 will happen just because it’s an awk/sed/of bash script, whatever yes?

Thanks for helping me, Tabitha

atjurhs 12-17-2012 11:52 AM

is it maybe some sort of byte counting after the 38th column, but I don't know how to do that

grail 12-17-2012 12:00 PM

I am a little confused by what you want to do?

You have file in the format you mentioned (yes or no)?

What do you wish to do with the data? ie put it another file?

What have you tried in the way of solving your problem, outside of pseudo code?

schneidz 12-17-2012 12:12 PM

Code:

awk 'substr($0,26,1) == " " {print}' test.lst

markush 12-17-2012 12:30 PM

With Perl
Code:

#!/usr/bin/perl

use strict ;
use warnings ;
use feature 'say' ;

while (<STDIN>) {
        my @ar = split /\s+/, $_ ;
        if ( $ar[3] =~ /FALSE|TRUE/ ) {
                splice @ar, 2, 0, "0" ;
                say "@ar";
        }
}

I understood that your file has whitespace as delimiters
Code:

markus@samsung:~/Programmierung/perl$ cat text.txt
      723.123456    1321      9462.123456    FALSE      etc.
  2384311.123456        5        741.123456    FALSE      etc.
    3276.123456      268    194532.123456    TRUE      etc.
  4563783.123456      13    438378.123456    FALSE      etc.
      354.123456        2      5634.123456    FALSE      etc.
      41.123456                  81.123456    FALSE      etc.
    6641.123456              67534.123456    FALSE      etc.
  136671.123456      67        675.123456    FALSE      etc.
      98.123456      43    786344.123456    FALSE      etc.
markus@samsung:~/Programmierung/perl$ ./script.pl <text.txt
 41.123456 0 81.123456 FALSE etc.
 6641.123456 0 67534.123456 FALSE etc.

where the name of the script is script.pl and the inputfile is text.txt.

It searches where FALSE/TRUE is in the wrong column and inserts a 0 in the second column, you will have to change the column numbers.

Here it prints only the changed lines.

Markus

atjurhs 12-17-2012 01:32 PM

yep, the file is white space delimeted.

the fixed file needs to look like this:
Code:

      column38  column39          column40      column41
------723.123456-----1321-------9462.123456-----FALSE------etc.
--2384311.123456--------5--------741.123456-----FALSE------etc.
-----3276.123456------268-----194532.123456-----TRUE-------etc.
--4563783.123456-------13-----438378.123456-----FALSE------etc.
------354.123456--------2-------5634.123456-----FALSE------etc.
-------41.123456--------0---------81.123456-----FALSE------etc.
-----6641.123456--------0------67534.123456-----FALSE------etc.
---136671.123456-------67--------675.123456-----FALSE------etc.
-------98.123456-------43-----786344.123456-----FALSE------etc.

so now rows 6 and 7 won't cause other programs to crash. without a fix (whatever it is) row 6 column 39 would have the value of 81.123456 and column 40 would have FALSE and so on down the row, and the same kind of thing would hapen in row 7, column 39 would have the value 67534.123456 and column 40 would have FALSE and so on down the row

atjurhs 12-17-2012 01:40 PM

markush that might just work??? but I will need it to print out the whole file, not just the fixed rows

couple questions.....

the 3 in ar[3] corresponds to the FALSE|TRUE column because it is 0 based?

the 2 in the line splice @ar, 2, 0, "0" ; means go back 2 columns

the "0" in the line splice @ar, 2, 0, "0" ; means pad with a 0

what does the first 0 mean in that line

thank you sooooo much for your help!!! Tabby

markush 12-17-2012 01:45 PM

The splice command of Perl takes these arguments, the array, the position, the length and the value to insert. the 0 is the length.

You should read
Code:

perldoc -f splice
You can use this code for the whole file, you will have to put the "say" line at the end of the while loop
Code:

#!/usr/bin/perl

use strict ;
use warnings ;
use feature 'say' ;

while (<STDIN>) {
        my @ar = split /\s+/, $_ ;
        if ( $ar[3] =~ /FALSE|TRUE/ ) {
                splice @ar, 2, 0, "0" ;
        }
        say "@ar";
}

Markus

atjurhs 12-17-2012 01:55 PM

sounds great, I'll give it a try.....


thanks again, Tabby

markush 12-17-2012 02:08 PM

For the formatted output you should take a look at Perls write command.
Code:

perldoc -f write
and
Code:

perldoc -f format
and here http://stackoverflow.com/questions/3...tput-with-perl

Markus

atjurhs 12-17-2012 03:56 PM

it gave me the error

Code:

Can't locate feature.pm in @INC  (@INC contains: /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 .) at ./script.pl line 5
WOW, that was alot of ugly typing!

ps. trying to help myself, I tried to do a perldoc on 'feature' but it said there is no documentation for perl function 'feature' thenI looked on Google and it said I have to have perl 5.10 changing to 5.10 is probably not going to happen but I'll ask :(

please say it's fixable by another way :)

markush 12-17-2012 04:02 PM

You can substitute
Code:

say "@ar" ;
with
Code:

print "@ar\n" ;
say is the same as print but it adds a newline automatically.

Markus

chrism01 12-17-2012 05:42 PM

Actually, I'd say the Perl unpack fn is ideal for this http://linux.die.net/man/1/perlpacktut

allend 12-18-2012 06:41 AM

Given the text.txt file in post #5, then this sed command (which matches the first 24 characters in a line followed by a space, then changes the space to zero)
Code:

sed 's:\(^.\{24\}\) :\10:g' text.txt
outputs
Code:

      723.123456    1321      9462.123456    FALSE      etc.
  2384311.123456        5        741.123456    FALSE      etc.
    3276.123456      268    194532.123456    TRUE      etc.
  4563783.123456      13    438378.123456    FALSE      etc.
      354.123456        2      5634.123456    FALSE      etc.
      41.123456        0        81.123456    FALSE      etc.
    6641.123456        0      67534.123456    FALSE      etc.
  136671.123456      67        675.123456    FALSE      etc.
      98.123456      43    786344.123456    FALSE      etc.

You could use the -i option to sed to make the changes in the file permanently.
The above assumes that your file has fixed width columns.

grail 12-18-2012 06:47 AM

Well as an alternative:
Code:

ruby -ane '$F.insert(1,0) if $F.length == 4;puts $F.join("\t")' file
Of course you would need to change numbers to reflect your actual data, but the ones here worked with the example.


All times are GMT -5. The time now is 05:44 AM.