reformatting a giant matrix
Hi boys,
I have a file.txt that has 22800 rows of data by 5 columns across the first thing I do with it is to split it up into six line pieces with Code:
split -d -l 6 file.txt then I put them back together (reordered) four at a time with Code:
paste x001 x002 x003 x004 > six_by_four_001.txt Code:
paste x005 x006 x007 x008 > six_by_four_002.txt Code:
paste x009 x010 x011 x012 > six_by_four_003.txt Code:
paste x013 x014 x015 x016 > six_by_four_004.txt Code:
paste x017 x018 x019 x020 > six_by_four_005.txt Code:
paste x021 x022 x023 x024 > six_by_four_006.txt Code:
paste x025 x026 x027 x028 > six_by_four_007.txt Code:
paste x029 x030 x031 x032 > six_by_four_008.txt Code:
paste x033 x034 x035 x036 > six_by_four_009.txt Code:
paste x037 x038 x039 x040 > six_by_four_010.txt Code:
cat six_by_four_001.txt six_by_four_002.txt six_by_four_003.txt six_by_four_004.txt six_by_four_005.txt six_by_four_006.txt six_by_four_007.txt six_by_four_008.txt six_by_four_009.txt six_by_four_010.txt > sixty_by_twenty.txt I'd really like some help automating it...... Thanks so much Tabitha! |
Hi girls,
can you tell us what you are actually trying to achieve? With that information we should be able to suggest an efficient approach. Evo2. |
I wish that were easier to say :( I'll try to describe in general:
I have this giant data file with all the data in a single matrix format and I need to have the data re-organized into a different format. The format I need has the elements in different locations and the the giant file broken into multiple smaller files based on other criteria. sorry it's really hard to explain, but I know the steps I have so far are correct, just a lot of typing, and then I'll have to do it again and again :( that's why I came asking for help. Tabby |
Hi
it would probably be easiest to do this in a language like python, but it could also be automated using the tools you are already using. Eg a quick and dirty script something like: Code:
#!/bin/bash PS. I'm having to guess here that this is what you actually want since, what you have presented is incomplete... |
Given the size of data and the manipulations involved, definitely write a program as mentioned above; my vote goes to Perl.
This would be able to do all you want in one program/pass and do it very quickly. In case you wondered, Perl is compiled-on-the fly, not interpreted like eg bash. It has lots of modules you may find useful and and 'extension' called PDL = Perl Data Lang, written especially for this kind of problem. http://perldoc.perl.org/ |
good morning guys!
I kinda misunderstood what I am supposed to do, super sorry! so what I am supposed to do is to take the 22800_by_5_matrix.txt file and do a bunch of chopping and a bunch of zero padding, and stick parts of it back together. Here's what I've done so far and I'm told the output is right for one "d_block_1" of data. I'll have to create 3800 of these. create a text file that has a matrix of 6 rows of zeros and 15 columns of zeros I'll call this 6_by_15_block_of_zeros.txt this file will get used many many times create another text file that has a matrix of 15 rows of zeros and 20 columns of zeros I'll call this 15_by_20_block_of_zeros.txt and this file will get used many many times Code:
using evo2's little piece of code I came up with (and I know this isn't right, it doesn't run), but I'm trying... #!/bin/bash split -d -l 6 22800_by_5_matrix.txt for i in {0..3799} ; do a=$(printf 'x%04d' $(i+1)) out=$(printf 'd_block_%04d.txt' $i) paste $a 6_by_15_block_of_zeros.txt > $6_by_20_padded_block_of_data.txt cat $6_by_20_padded_block_of_data.txt 15_by_20_block_of_zeros.txt > $out done thanks guys so much for your help!!! Tabby |
hmmm ... well not sure how fast it will run or even if i have the whole picture yet, but maybe something like:
Code:
#!/usr/bin/awk -f Code:
./script.awk 22800_by_5_matrix.txt |
aaaah, you've come to my rescue again :)
after running your script (that works perfectly btw) I then execute Code:
Code:
x000 x001 x002... x379 the last thing I have to do is rename the files. I think I can do this with move and a for loop, but I need the 1st group of ten files need to be called: data_13.dat data_16.dat data_19.dat data_22.dat data_25.dat data_28.dat data_31.dat data_34.dat data_37.dat data_40.dat and then the 2nd group of ten files need to be called: data_13.dat data_16.dat data_19.dat data_22.dat data_25.dat data_28.dat data_31.dat data_34.dat data_37.dat data_40.dat etc. etc. etc. all the way to the last group of ten files called: data_13.dat data_16.dat data_19.dat data_22.dat data_25.dat data_28.dat data_31.dat data_34.dat data_37.dat data_40.dat so I know the only way to do this and not overwrite files is to put them in seperate folders. so I have folders f1 f2 f3... f38 and then each of the folders gets a group of ten files moved into them. so the two tasks I have left are 1) moving groups of 10 files into their proper folder 2) batch renaming the files with incrementing a part of the name so that's what I'm going to work on, and I'll let you know when I get stuck :( thanks so much, Tabby |
In awk (pseudo code)
In the BEGIN block, define an array of 6 x 20 In the regular block, 1. LET r=(line_no mod 24) 2. If r = 1 then initialize the array. ie. first, 7th, 13th etc record 3. r < 7, then write $1 to $5 of the row to the array element (r,1) to (r,5); break; 4. r < 13, then write $1 to $5 of the row to the array element (r-12,6) to (r-12,10); break; 5. r < 19, then write $1 to $5 of the row to the array element (r-18,11) to (r-18,15); break; 6. r < 25, then write $1 to $5 of the row to the array element (r-24,16) to (r-24,20); break; 7. When the array is full .ie. when you have processed row no(line_no mod 24) then write the array. This should do it. OK |
I am with AnanthaP ... I would simply alter the original code to give you your final output. Just think of what a final file would look like and then reverse the process till you get back
to the start. I guess at the end of the day, it will boil down to how many times you have to do this and how often any of the parameters change, as to whether or not you should create a single script / program. |
yikes! guys, that looks way too scary for me. I have to take little steps and add them together, but I have made a little progress....
after running grail's awk script, then from the system prompt I can run Code:
cat d_block_* > one_giant_padded_file.txt I thought maybe Code:
so I tried using the system command Code:
to I could keep working on this, I made the giant_padded_file.txt and to keep working in the awk world, I wrote Code:
awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 giant_padded_file.txt so now I'd like to move every group of 10 files into it's own directory. I know there will be 38 directories so I could create them beforehand or have the script to it. Moving the files is what I'll work on next... like cat, could mv also be done as a system command, that might be easier for moving the 10 files at a time??? |
If you are going to use cat then do it separately. you can call command from inside awk but this would not server much purpose as the next step would then be another shell command.
|
Hi grail,
I guess I should find a way in awk to cat the 380 files (produced by your script) using awk? and that would be better for when I add in the Code:
awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 giant_padded_file.txt yes? |
so now I know I can effectively "cat" the files in awk with
Code:
awk '{print $0}' d_block_* > one_giant_padded_file.txt Code:
#!/usr/bin/awk -f Code:
awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 giant_padded_file.txt |
hmmm ... need to step back a bit. The extra you have done is an entirely new awk script and cannot be simply attached to the previous.
If I understand correctly, you are simply grabbing all the output files created by my awk and then catting them into a new single file .... yes? If yes, then the original script is now easier :) Instead of outputting to multiple files, remove the redirect from inside the script and simply output all data to your new big file. Code:
#!/usr/bin/awk -f Code:
./script.awk 22800_by_5_matrix.txt > one_giant_padded_file.txt |
All times are GMT -5. The time now is 06:36 PM. |