LinuxQuestions.org - a really tough (at least for me) colum data file missing value problem

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - a really tough (at least for me) colum data file missing value problem (https://www.linuxquestions.org/questions/linux-newbie-8/a-really-tough-at-least-for-me-colum-data-file-missing-value-problem-4175441792/)

a really tough (at least for me) colum data file missing value problem

Good morning guys,

I’m having a really hard time (in fact I’m completely stuck) trying to process a space separated data file because one of the columns is sometimes filled with a space where it should have a 0. The problem happens on the 39th column. Most of the time the 39th column has numbers, but sometimes ( maybe 20% of the time ) where the 39th column should have a 0, it has a space, and on a space separated data file that shifts all the data from the 40th column and on to the left one column, so now all the data that comes after the 39th is shifted over by one column and things are all screwed up.

What I think will help is that 39th column should always contains an integer, and that the integer is always aligned to the right side of the column, plus the 38th and 40th columns always have 6 numbers after it’s decimal and there are always 9 “steps” from the last number in the 38th column to the last number in the 39th column. Or you could also count that there's always 15 “steps” from column 38th’s decimal point to the last number in the 39th column's integer.

Here’s what it looks like, and so you can see the problem easier, I will replace spaces withr dashes, and I’ll write the decimal part of the 38th and 40th columns as “123456”, but you know that’s not how the data file really is, so just for ease of understanding

Code:

      column38  column39          column40      column41

------723.123456-----1321-------9462.123456-----FALSE------etc.

--2384311.123456--------5--------741.123456-----FALSE------etc.

-----3276.123456------268-----194532.123456-----TRUE-------etc.

--4563783.123456-------13-----438378.123456-----FALSE------etc.

------354.123456--------2-------5634.123456-----FALSE------etc.

-------41.123456------------------81.123456-----FALSE------etc.

-----6641.123456---------------67534.123456-----FALSE------etc.

---136671.123456-------67--------675.123456-----FALSE------etc.

-------98.123456-------43-----786344.123456-----FALSE------etc.



so in this example the 6th and 7th rows will mess everything up.

In pseudo-code I think the answer is:
1.Count over to the 38th column
2.Count over 9 “steps” from the last number in the 38th column
3.If the 9th “step” is not a number, make it a 0
4.Else continue to go down the rows looking for the problem

In pseudo-code I think but, idk
step 1 is easy
I have no idea how to do step 2
step 3 is maybe a simple “if” statement ?
and step 4 will happen just because it’s an awk/sed/of bash script, whatever yes?

Thanks for helping me, Tabitha

is it maybe some sort of byte counting after the 38th column, but I don't know how to do that

I am a little confused by what you want to do?

You have file in the format you mentioned (yes or no)?

What do you wish to do with the data? ie put it another file?

What have you tried in the way of solving your problem, outside of pseudo code?

Code:

awk 'substr($0,26,1) == " " {print}' test.lst

With Perl

Code:

#!/usr/bin/perl



use strict ;

use warnings ;

use feature 'say' ;



while (<STDIN>) {

        my @ar = split /\s+/, $_ ;

        if ( $ar[3] =~ /FALSE|TRUE/ ) {

                splice @ar, 2, 0, "0" ;

                say "@ar";

        }

}

I understood that your file has whitespace as delimiters

Code:

markus@samsung:~/Programmierung/perl$ cat text.txt

      723.123456    1321      9462.123456    FALSE      etc.

  2384311.123456        5        741.123456    FALSE      etc.

    3276.123456      268    194532.123456    TRUE      etc.

  4563783.123456      13    438378.123456    FALSE      etc.

      354.123456        2      5634.123456    FALSE      etc.

      41.123456                  81.123456    FALSE      etc.

    6641.123456              67534.123456    FALSE      etc.

  136671.123456      67        675.123456    FALSE      etc.

      98.123456      43    786344.123456    FALSE      etc.

markus@samsung:~/Programmierung/perl$ ./script.pl <text.txt

 41.123456 0 81.123456 FALSE etc.

 6641.123456 0 67534.123456 FALSE etc.

where the name of the script is script.pl and the inputfile is text.txt.

It searches where FALSE/TRUE is in the wrong column and inserts a 0 in the second column, you will have to change the column numbers.

Here it prints only the changed lines.

Markus

yep, the file is white space delimeted.

the fixed file needs to look like this:

Code:

      column38  column39          column40      column41

------723.123456-----1321-------9462.123456-----FALSE------etc.

--2384311.123456--------5--------741.123456-----FALSE------etc.

-----3276.123456------268-----194532.123456-----TRUE-------etc.

--4563783.123456-------13-----438378.123456-----FALSE------etc.

------354.123456--------2-------5634.123456-----FALSE------etc.

-------41.123456--------0---------81.123456-----FALSE------etc.

-----6641.123456--------0------67534.123456-----FALSE------etc.

---136671.123456-------67--------675.123456-----FALSE------etc.

-------98.123456-------43-----786344.123456-----FALSE------etc.

so now rows 6 and 7 won't cause other programs to crash. without a fix (whatever it is) row 6 column 39 would have the value of 81.123456 and column 40 would have FALSE and so on down the row, and the same kind of thing would hapen in row 7, column 39 would have the value 67534.123456 and column 40 would have FALSE and so on down the row

markush that might just work??? but I will need it to print out the whole file, not just the fixed rows

couple questions.....

the 3 in ar[3] corresponds to the FALSE|TRUE column because it is 0 based?

the 2 in the line splice @ar, 2, 0, "0" ; means go back 2 columns

the "0" in the line splice @ar, 2, 0, "0" ; means pad with a 0

what does the first 0 mean in that line

thank you sooooo much for your help!!! Tabby

The splice command of Perl takes these arguments, the array, the position, the length and the value to insert. the 0 is the length.

You should read

Code:

perldoc -f splice

You can use this code for the whole file, you will have to put the "say" line at the end of the while loop

Code:

#!/usr/bin/perl



use strict ;

use warnings ;

use feature 'say' ;



while (<STDIN>) {

        my @ar = split /\s+/, $_ ;

        if ( $ar[3] =~ /FALSE|TRUE/ ) {

                splice @ar, 2, 0, "0" ;

        }

        say "@ar";

}

Markus

sounds great, I'll give it a try.....

thanks again, Tabby

For the formatted output you should take a look at Perls write command.

Code:

perldoc -f write

and

Code:

perldoc -f format

and here http://stackoverflow.com/questions/3...tput-with-perl

Markus

it gave me the error

Code:

Can't locate feature.pm in @INC  (@INC contains: /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi /usr/lib/perl5/5.8.8 .) at ./script.pl line 5

WOW, that was alot of ugly typing!

ps. trying to help myself, I tried to do a perldoc on 'feature' but it said there is no documentation for perl function 'feature' thenI looked on Google and it said I have to have perl 5.10 changing to 5.10 is probably not going to happen but I'll ask :(

please say it's fixable by another way :)

You can substitute

Code:

say "@ar" ;

with

Code:

print "@ar\n" ;

say is the same as print but it adds a newline automatically.

Markus

Actually, I'd say the Perl unpack fn is ideal for this http://linux.die.net/man/1/perlpacktut

Given the text.txt file in post #5, then this sed command (which matches the first 24 characters in a line followed by a space, then changes the space to zero)

Code:

sed 's:$^.\{24\}$ :\10:g' text.txt

outputs

Code:

      723.123456    1321      9462.123456    FALSE      etc.

  2384311.123456        5        741.123456    FALSE      etc.

    3276.123456      268    194532.123456    TRUE      etc.

  4563783.123456      13    438378.123456    FALSE      etc.

      354.123456        2      5634.123456    FALSE      etc.

      41.123456        0        81.123456    FALSE      etc.

    6641.123456        0      67534.123456    FALSE      etc.

  136671.123456      67        675.123456    FALSE      etc.

      98.123456      43    786344.123456    FALSE      etc.

You could use the -i option to sed to make the changes in the file permanently.
The above assumes that your file has fixed width columns.

Well as an alternative:

Code:

ruby -ane '$F.insert(1,0) if $F.length == 4;puts $F.join("\t")' file

Of course you would need to change numbers to reflect your actual data, but the ones here worked with the example.