LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   reformatting a giant matrix (https://www.linuxquestions.org/questions/linux-newbie-8/reformatting-a-giant-matrix-4175467385/)

atjurhs 06-25-2013 05:11 PM

reformatting a giant matrix
 
Hi boys,

I have a file.txt that has 22800 rows of data by 5 columns across

the first thing I do with it is to split it up into six line pieces with

Code:

split -d -l 6 file.txt
and that gets me 3800 files that are all matrices of row 6 by column 5

then I put them back together (reordered) four at a time with

Code:

paste x001 x002 x003 x004 > six_by_four_001.txt
then
Code:

paste x005 x006 x007 x008 > six_by_four_002.txt
and
Code:

paste x009 x010 x011 x012 > six_by_four_003.txt
and
Code:

paste x013 x014 x015 x016 > six_by_four_004.txt
and
Code:

paste x017 x018 x019 x020 > six_by_four_005.txt
and
Code:

paste x021 x022 x023 x024 > six_by_four_006.txt
and
Code:

paste x025 x026 x027 x028 > six_by_four_007.txt
and
Code:

paste x029 x030 x031 x032 > six_by_four_008.txt
and
Code:

paste x033 x034 x035 x036 > six_by_four_009.txt
and
Code:

paste x037 x038 x039 x040 > six_by_four_010.txt
this gets me ten files that have 6rows and 20columns then I cat those those together with

Code:

cat six_by_four_001.txt six_by_four_002.txt six_by_four_003.txt six_by_four_004.txt six_by_four_005.txt six_by_four_006.txt six_by_four_007.txt six_by_four_008.txt six_by_four_009.txt six_by_four_010.txt > sixty_by_twenty.txt
as you can see this is a lot of "by hand" and I haven't gotten through all the 3800 files, yikes!!!!

I'd really like some help automating it......

Thanks so much Tabitha!

evo2 06-25-2013 05:15 PM

Hi girls,

can you tell us what you are actually trying to achieve? With that information we should be able to suggest an efficient approach.

Evo2.

atjurhs 06-25-2013 05:27 PM

I wish that were easier to say :( I'll try to describe in general:

I have this giant data file with all the data in a single matrix format and I need to have the data re-organized into a different format. The format I need has the elements in different locations and the the giant file broken into multiple smaller files based on other criteria.

sorry it's really hard to explain, but I know the steps I have so far are correct, just a lot of typing, and then I'll have to do it again and again :( that's why I came asking for help.

Tabby

evo2 06-25-2013 05:44 PM

Hi

it would probably be easiest to do this in a language like python, but it could also be automated using the tools you are already using. Eg a quick and dirty script something like:
Code:

#!/bin/bash
split -d -l 6 file.txt
for i in {0..3799} ; do
  a=$(printf 'x%04d' $((i*4+1)) )
  b=$(printf 'x%04d' $((i*4+2)) )
  c=$(printf 'x%04d' $((i*4+3)) )
  d=$(printf 'x%04d' $((i*4+4)) )
  out=$(printf 'six_by_four_%04d.txt' $i)
  paste $a $b $c $d > $out
done
cat six_by_four_*.txt > sixty_by_twenty.txt
\rm six_by_four_*.txt

Evo2.

PS. I'm having to guess here that this is what you actually want since, what you have presented is incomplete...

chrism01 06-25-2013 06:12 PM

Given the size of data and the manipulations involved, definitely write a program as mentioned above; my vote goes to Perl.
This would be able to do all you want in one program/pass and do it very quickly.
In case you wondered, Perl is compiled-on-the fly, not interpreted like eg bash.
It has lots of modules you may find useful and and 'extension' called PDL = Perl Data Lang, written especially for this kind of problem.
http://perldoc.perl.org/

atjurhs 06-26-2013 11:24 AM

good morning guys!

I kinda misunderstood what I am supposed to do, super sorry!

so what I am supposed to do is to take the 22800_by_5_matrix.txt file and do a bunch of chopping and a bunch of zero padding, and stick parts of it back together. Here's what I've done so far and I'm told the output is right for one "d_block_1" of data. I'll have to create 3800 of these.

create a text file that has a matrix of 6 rows of zeros and 15 columns of zeros I'll call this 6_by_15_block_of_zeros.txt this file will get used many many times
create another text file that has a matrix of 15 rows of zeros and 20 columns of zeros I'll call this 15_by_20_block_of_zeros.txt and this file will get used many many times

Code:


split -d -l 6 22800_by_5_file.txt      % this creates 3800 files named x00, x01, x02, x03... x3800 that are 6_by_5_matrices
paste x00 6_by_15_block_of_zeros.txt >  6_by_20_padded_block_of_data.txt  % I think this output is a tmp file which can be overwritten each time this gets looped thru
cat 6_by_20_padded_block_of_data.txt  15_by_20_block_of_zeros.txt >  d_block_00.dat

this creates one 21_by_20 marix of data for the x00 file with the upper left 6_by_5 elements having data and the rest of the elements being zeros, so now I need to loop over this 3800 times.

using evo2's little piece of code I came up with (and I know this isn't right, it doesn't run), but I'm trying...

#!/bin/bash
split -d -l 6 22800_by_5_matrix.txt
for i in {0..3799} ; do
a=$(printf 'x%04d' $(i+1))
out=$(printf 'd_block_%04d.txt' $i)
paste $a 6_by_15_block_of_zeros.txt > $6_by_20_padded_block_of_data.txt
cat $6_by_20_padded_block_of_data.txt 15_by_20_block_of_zeros.txt > $out
done

thanks guys so much for your help!!!

Tabby

grail 06-26-2013 11:58 AM

hmmm ... well not sure how fast it will run or even if i have the whole picture yet, but maybe something like:
Code:

#!/usr/bin/awk -f

BEGIN{
        end_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
        extra_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"

        cnt = 0
        file_name = "d_block_"

        end = sprintf("%02d.dat",cnt++)
}

{ print $0,end_zeroes > file_name end }

!(NR % 6){

        for(i = 0; i < 15; i++)
                print extra_zeroes > file_name end

        end = sprintf("%02d.dat",cnt++)
}

And you would run it like:
Code:

./script.awk 22800_by_5_matrix.txt

atjurhs 06-26-2013 02:29 PM

aaaah, you've come to my rescue again :)

after running your script (that works perfectly btw) I then execute

Code:


cat d_block_* > one_big_padded_file.dat
split -d -a 3 -l 210 one_big_padded_file.dat

and I get out 380 files named

Code:

x000 x001 x002... x379
perfect :) and each of them has 210 rows by 20 columns of data per file, double perfect :))

the last thing I have to do is rename the files. I think I can do this with move and a for loop, but

I need the 1st group of ten files need to be called: data_13.dat data_16.dat data_19.dat data_22.dat data_25.dat data_28.dat data_31.dat data_34.dat data_37.dat data_40.dat
and then the 2nd group of ten files need to be called: data_13.dat data_16.dat data_19.dat data_22.dat data_25.dat data_28.dat data_31.dat data_34.dat data_37.dat data_40.dat
etc.
etc.
etc.
all the way to the last group of ten files called: data_13.dat data_16.dat data_19.dat data_22.dat data_25.dat data_28.dat data_31.dat data_34.dat data_37.dat data_40.dat

so I know the only way to do this and not overwrite files is to put them in seperate folders. so I have folders f1 f2 f3... f38 and then each of the folders gets a group of ten files moved into them.

so the two tasks I have left are

1) moving groups of 10 files into their proper folder
2) batch renaming the files with incrementing a part of the name


so that's what I'm going to work on, and I'll let you know when I get stuck :(

thanks so much, Tabby

AnanthaP 06-27-2013 03:24 AM

In awk (pseudo code)
In the BEGIN block, define an array of 6 x 20
In the regular block,
1. LET r=(line_no mod 24)
2. If r = 1 then initialize the array. ie. first, 7th, 13th etc record
3. r < 7, then write $1 to $5 of the row to the array element (r,1) to (r,5); break;
4. r < 13, then write $1 to $5 of the row to the array element (r-12,6) to (r-12,10); break;
5. r < 19, then write $1 to $5 of the row to the array element (r-18,11) to (r-18,15); break;
6. r < 25, then write $1 to $5 of the row to the array element (r-24,16) to (r-24,20); break;
7. When the array is full .ie. when you have processed row no(line_no mod 24) then write the array.

This should do it.

OK

grail 06-27-2013 07:38 AM

I am with AnanthaP ... I would simply alter the original code to give you your final output. Just think of what a final file would look like and then reverse the process till you get back
to the start.

I guess at the end of the day, it will boil down to how many times you have to do this and how often any of the parameters change, as to whether or not you should create a single script / program.

atjurhs 06-27-2013 11:10 AM

yikes! guys, that looks way too scary for me. I have to take little steps and add them together, but I have made a little progress....

after running grail's awk script, then from the system prompt I can run

Code:

cat d_block_* > one_giant_padded_file.txt
and I get what I want for my next step, but I can't find the right syntax to run it inside of grail's awk script? Is there a way to run the cat command inside the awk script?

I thought maybe
Code:


{
cat d_block_* > one_giant_padded_file.txt)
}

but this has syntax errors about the redirect >

so I tried using the system command
Code:


{
system("cat" d_block_* > giant_padded_file.txt)
}

I still get syntax errors on the redirect > I'm not sure how to fix???

to I could keep working on this, I made the giant_padded_file.txt

and to keep working in the awk world, I wrote
Code:

awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 giant_padded_file.txt
which does do the next step in the script

so now I'd like to move every group of 10 files into it's own directory. I know there will be 38 directories so I could create them beforehand or have the script to it. Moving the files is what I'll work on next...

like cat, could mv also be done as a system command, that might be easier for moving the 10 files at a time???

grail 06-27-2013 11:49 AM

If you are going to use cat then do it separately. you can call command from inside awk but this would not server much purpose as the next step would then be another shell command.

atjurhs 06-27-2013 12:03 PM

Hi grail,

I guess I should find a way in awk to cat the 380 files (produced by your script) using awk? and that would be better for when I add in the
Code:

awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 giant_padded_file.txt
part of the script

yes?

atjurhs 06-27-2013 01:50 PM

so now I know I can effectively "cat" the files in awk with

Code:

awk '{print $0}' d_block_* > one_giant_padded_file.txt
but I can't figure out how to add that line of awk to grail's script (I've been trying with lots of goffy out put).
Code:

#!/usr/bin/awk -f

BEGIN{
        end_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
        extra_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"

        cnt = 0
        file_name = "d_block_"

        end = sprintf("%02d.dat",cnt++)
}

{ print $0,end_zeroes > file_name end }

!(NR % 6){

        for(i = 0; i < 15; i++)
                print extra_zeroes > file_name end

        end = sprintf("%02d.dat",cnt++)
}

???

I need to learn how to add a line of awk to the end of grail's script so that later on I can add my own
Code:

awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 giant_padded_file.txt
this step by step is probably the prettiest way to do it but I'm learning with little bites

grail 06-27-2013 07:00 PM

hmmm ... need to step back a bit. The extra you have done is an entirely new awk script and cannot be simply attached to the previous.

If I understand correctly, you are simply grabbing all the output files created by my awk and then catting them into a new single file .... yes?

If yes, then the original script is now easier :) Instead of outputting to multiple files, remove the redirect from inside the script and simply output all data to your new big file.
Code:

#!/usr/bin/awk -f

BEGIN{
        end_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
        extra_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
}

{ print $0,end_zeroes }

!(NR % 6){

        for(i = 0; i < 15; i++)
                print extra_zeroes
}

And call it like:
Code:

./script.awk 22800_by_5_matrix.txt > one_giant_padded_file.txt
However, as you are now looking at the next part, which is to once again split the data every 210 lines, then this can now be added to the above to change file names every 210 ... I'll let you work out how :)


All times are GMT -5. The time now is 06:36 PM.