Home Forums HCL Reviews Tutorials Articles Register Search Today's Posts Mark Forums Read
 LinuxQuestions.org reformatting a giant matrix
 Linux - Newbie This Linux forum is for members that are new to Linux. Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

 06-25-2013, 05:11 PM #1 atjurhs Member   Registered: Aug 2012 Posts: 185 Rep: reformatting a giant matrix Hi boys, I have a file.txt that has 22800 rows of data by 5 columns across the first thing I do with it is to split it up into six line pieces with Code: split -d -l 6 file.txt and that gets me 3800 files that are all matrices of row 6 by column 5 then I put them back together (reordered) four at a time with Code:  paste x001 x002 x003 x004 > six_by_four_001.txt then Code:  paste x005 x006 x007 x008 > six_by_four_002.txt and Code:  paste x009 x010 x011 x012 > six_by_four_003.txt and Code:  paste x013 x014 x015 x016 > six_by_four_004.txt and Code:  paste x017 x018 x019 x020 > six_by_four_005.txt and Code:  paste x021 x022 x023 x024 > six_by_four_006.txt and Code:  paste x025 x026 x027 x028 > six_by_four_007.txt and Code:  paste x029 x030 x031 x032 > six_by_four_008.txt and Code:  paste x033 x034 x035 x036 > six_by_four_009.txt and Code:  paste x037 x038 x039 x040 > six_by_four_010.txt this gets me ten files that have 6rows and 20columns then I cat those those together with Code: cat six_by_four_001.txt six_by_four_002.txt six_by_four_003.txt six_by_four_004.txt six_by_four_005.txt six_by_four_006.txt six_by_four_007.txt six_by_four_008.txt six_by_four_009.txt six_by_four_010.txt > sixty_by_twenty.txt as you can see this is a lot of "by hand" and I haven't gotten through all the 3800 files, yikes!!!! I'd really like some help automating it...... Thanks so much Tabitha!
 06-25-2013, 05:15 PM #2 evo2 LQ Guru   Registered: Jan 2009 Location: Japan Distribution: Mostly Debian and Scientific Linux Posts: 5,753 Rep: Hi girls, can you tell us what you are actually trying to achieve? With that information we should be able to suggest an efficient approach. Evo2.
 06-25-2013, 05:27 PM #3 atjurhs Member   Registered: Aug 2012 Posts: 185 Original Poster Rep: I wish that were easier to say I'll try to describe in general: I have this giant data file with all the data in a single matrix format and I need to have the data re-organized into a different format. The format I need has the elements in different locations and the the giant file broken into multiple smaller files based on other criteria. sorry it's really hard to explain, but I know the steps I have so far are correct, just a lot of typing, and then I'll have to do it again and again that's why I came asking for help. Tabby
 06-25-2013, 05:44 PM #4 evo2 LQ Guru   Registered: Jan 2009 Location: Japan Distribution: Mostly Debian and Scientific Linux Posts: 5,753 Rep: Hi it would probably be easiest to do this in a language like python, but it could also be automated using the tools you are already using. Eg a quick and dirty script something like: Code: #!/bin/bash split -d -l 6 file.txt for i in {0..3799} ; do a=$(printf 'x%04d'$((i*4+1)) ) b=$(printf 'x%04d'$((i*4+2)) ) c=$(printf 'x%04d'$((i*4+3)) ) d=$(printf 'x%04d'$((i*4+4)) ) out=$(printf 'six_by_four_%04d.txt'$i) paste $a$b $c$d > $out done cat six_by_four_*.txt > sixty_by_twenty.txt \rm six_by_four_*.txt Evo2. PS. I'm having to guess here that this is what you actually want since, what you have presented is incomplete... Last edited by evo2; 06-25-2013 at 05:51 PM. Reason: split output files are numbered from 1, not 0.  06-25-2013, 06:12 PM #5 chrism01 LQ Guru Registered: Aug 2004 Location: Sydney Distribution: Centos 6.8, Centos 5.10 Posts: 17,295 Rep: Given the size of data and the manipulations involved, definitely write a program as mentioned above; my vote goes to Perl. This would be able to do all you want in one program/pass and do it very quickly. In case you wondered, Perl is compiled-on-the fly, not interpreted like eg bash. It has lots of modules you may find useful and and 'extension' called PDL = Perl Data Lang, written especially for this kind of problem. http://perldoc.perl.org/  06-26-2013, 11:24 AM #6 atjurhs Member Registered: Aug 2012 Posts: 185 Original Poster Rep: good morning guys! I kinda misunderstood what I am supposed to do, super sorry! so what I am supposed to do is to take the 22800_by_5_matrix.txt file and do a bunch of chopping and a bunch of zero padding, and stick parts of it back together. Here's what I've done so far and I'm told the output is right for one "d_block_1" of data. I'll have to create 3800 of these. create a text file that has a matrix of 6 rows of zeros and 15 columns of zeros I'll call this 6_by_15_block_of_zeros.txt this file will get used many many times create another text file that has a matrix of 15 rows of zeros and 20 columns of zeros I'll call this 15_by_20_block_of_zeros.txt and this file will get used many many times Code:  split -d -l 6 22800_by_5_file.txt % this creates 3800 files named x00, x01, x02, x03... x3800 that are 6_by_5_matrices paste x00 6_by_15_block_of_zeros.txt > 6_by_20_padded_block_of_data.txt % I think this output is a tmp file which can be overwritten each time this gets looped thru cat 6_by_20_padded_block_of_data.txt 15_by_20_block_of_zeros.txt > d_block_00.dat this creates one 21_by_20 marix of data for the x00 file with the upper left 6_by_5 elements having data and the rest of the elements being zeros, so now I need to loop over this 3800 times. using evo2's little piece of code I came up with (and I know this isn't right, it doesn't run), but I'm trying... #!/bin/bash split -d -l 6 22800_by_5_matrix.txt for i in {0..3799} ; do a=$(printf 'x%04d' $(i+1)) out=$(printf 'd_block_%04d.txt' $i) paste$a 6_by_15_block_of_zeros.txt > $6_by_20_padded_block_of_data.txt cat$6_by_20_padded_block_of_data.txt 15_by_20_block_of_zeros.txt > $out done thanks guys so much for your help!!! Tabby  06-26-2013, 11:58 AM #7 grail LQ Guru Registered: Sep 2009 Location: Perth Distribution: Manjaro Posts: 9,437 Rep: hmmm ... well not sure how fast it will run or even if i have the whole picture yet, but maybe something like: Code: #!/usr/bin/awk -f BEGIN{ end_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0" extra_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0" cnt = 0 file_name = "d_block_" end = sprintf("%02d.dat",cnt++) } { print$0,end_zeroes > file_name end } !(NR % 6){ for(i = 0; i < 15; i++) print extra_zeroes > file_name end end = sprintf("%02d.dat",cnt++) } And you would run it like: Code: ./script.awk 22800_by_5_matrix.txt
 06-26-2013, 02:29 PM #8 atjurhs Member   Registered: Aug 2012 Posts: 185 Original Poster Rep: aaaah, you've come to my rescue again after running your script (that works perfectly btw) I then execute Code:  cat d_block_* > one_big_padded_file.dat split -d -a 3 -l 210 one_big_padded_file.dat and I get out 380 files named Code:  x000 x001 x002... x379 perfect and each of them has 210 rows by 20 columns of data per file, double perfect ) the last thing I have to do is rename the files. I think I can do this with move and a for loop, but I need the 1st group of ten files need to be called: data_13.dat data_16.dat data_19.dat data_22.dat data_25.dat data_28.dat data_31.dat data_34.dat data_37.dat data_40.dat and then the 2nd group of ten files need to be called: data_13.dat data_16.dat data_19.dat data_22.dat data_25.dat data_28.dat data_31.dat data_34.dat data_37.dat data_40.dat etc. etc. etc. all the way to the last group of ten files called: data_13.dat data_16.dat data_19.dat data_22.dat data_25.dat data_28.dat data_31.dat data_34.dat data_37.dat data_40.dat so I know the only way to do this and not overwrite files is to put them in seperate folders. so I have folders f1 f2 f3... f38 and then each of the folders gets a group of ten files moved into them. so the two tasks I have left are 1) moving groups of 10 files into their proper folder 2) batch renaming the files with incrementing a part of the name so that's what I'm going to work on, and I'll let you know when I get stuck thanks so much, Tabby
 06-27-2013, 03:24 AM #9 AnanthaP Member   Registered: Jul 2004 Location: Chennai, India Distribution: UBUNTU 5.10 since Jul-18,2006 on Intel 820 DC Posts: 820 Rep: In awk (pseudo code) In the BEGIN block, define an array of 6 x 20 In the regular block, 1. LET r=(line_no mod 24) 2. If r = 1 then initialize the array. ie. first, 7th, 13th etc record 3. r < 7, then write $1 to$5 of the row to the array element (r,1) to (r,5); break; 4. r < 13, then write $1 to$5 of the row to the array element (r-12,6) to (r-12,10); break; 5. r < 19, then write $1 to$5 of the row to the array element (r-18,11) to (r-18,15); break; 6. r < 25, then write $1 to$5 of the row to the array element (r-24,16) to (r-24,20); break; 7. When the array is full .ie. when you have processed row no(line_no mod 24) then write the array. This should do it. OK
 06-27-2013, 07:38 AM #10 grail LQ Guru   Registered: Sep 2009 Location: Perth Distribution: Manjaro Posts: 9,437 Rep: I am with AnanthaP ... I would simply alter the original code to give you your final output. Just think of what a final file would look like and then reverse the process till you get back to the start. I guess at the end of the day, it will boil down to how many times you have to do this and how often any of the parameters change, as to whether or not you should create a single script / program.
 06-27-2013, 11:10 AM #11 atjurhs Member   Registered: Aug 2012 Posts: 185 Original Poster Rep: yikes! guys, that looks way too scary for me. I have to take little steps and add them together, but I have made a little progress.... after running grail's awk script, then from the system prompt I can run Code:  cat d_block_* > one_giant_padded_file.txt and I get what I want for my next step, but I can't find the right syntax to run it inside of grail's awk script? Is there a way to run the cat command inside the awk script? I thought maybe Code:  { cat d_block_* > one_giant_padded_file.txt) } but this has syntax errors about the redirect > so I tried using the system command Code:  { system("cat" d_block_* > giant_padded_file.txt) } I still get syntax errors on the redirect > I'm not sure how to fix??? to I could keep working on this, I made the giant_padded_file.txt and to keep working in the awk world, I wrote Code: awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 giant_padded_file.txt which does do the next step in the script so now I'd like to move every group of 10 files into it's own directory. I know there will be 38 directories so I could create them beforehand or have the script to it. Moving the files is what I'll work on next... like cat, could mv also be done as a system command, that might be easier for moving the 10 files at a time???
 06-27-2013, 11:49 AM #12 grail LQ Guru   Registered: Sep 2009 Location: Perth Distribution: Manjaro Posts: 9,437 Rep: If you are going to use cat then do it separately. you can call command from inside awk but this would not server much purpose as the next step would then be another shell command.
 06-27-2013, 12:03 PM #13 atjurhs Member   Registered: Aug 2012 Posts: 185 Original Poster Rep: Hi grail, I guess I should find a way in awk to cat the 380 files (produced by your script) using awk? and that would be better for when I add in the Code: awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 giant_padded_file.txt part of the script yes?
 06-27-2013, 01:50 PM #14 atjurhs Member   Registered: Aug 2012 Posts: 185 Original Poster Rep: so now I know I can effectively "cat" the files in awk with Code:  awk '{print $0}' d_block_* > one_giant_padded_file.txt but I can't figure out how to add that line of awk to grail's script (I've been trying with lots of goffy out put). Code: #!/usr/bin/awk -f BEGIN{ end_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0" extra_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0" cnt = 0 file_name = "d_block_" end = sprintf("%02d.dat",cnt++) } { print$0,end_zeroes > file_name end } !(NR % 6){ for(i = 0; i < 15; i++) print extra_zeroes > file_name end end = sprintf("%02d.dat",cnt++) } ??? I need to learn how to add a line of awk to the end of grail's script so that later on I can add my own Code: awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 giant_padded_file.txt this step by step is probably the prettiest way to do it but I'm learning with little bites Last edited by atjurhs; 06-27-2013 at 01:54 PM.
 06-27-2013, 07:00 PM #15 grail LQ Guru   Registered: Sep 2009 Location: Perth Distribution: Manjaro Posts: 9,437 Rep: hmmm ... need to step back a bit. The extra you have done is an entirely new awk script and cannot be simply attached to the previous. If I understand correctly, you are simply grabbing all the output files created by my awk and then catting them into a new single file .... yes? If yes, then the original script is now easier Instead of outputting to multiple files, remove the redirect from inside the script and simply output all data to your new big file. Code: #!/usr/bin/awk -f BEGIN{ end_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0" extra_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0" } { print \$0,end_zeroes } !(NR % 6){ for(i = 0; i < 15; i++) print extra_zeroes } And call it like: Code: ./script.awk 22800_by_5_matrix.txt > one_giant_padded_file.txt However, as you are now looking at the next part, which is to once again split the data every 210 lines, then this can now be added to the above to change file names every 210 ... I'll let you work out how

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is Off HTML code is Off Forum Rules

 Similar Threads Thread Thread Starter Forum Replies Last Post ejspeiro Programming 9 04-18-2011 09:41 PM frenchn00b Linux - Desktop 2 08-20-2009 10:00 AM johnpaulodonnell Programming 4 04-30-2008 01:45 PM Hitboxx General 12 08-09-2007 09:46 AM pk21 Linux - General 4 09-04-2003 01:37 PM

All times are GMT -5. The time now is 07:00 AM.

 Contact Us - Advertising Info - Rules - LQ Merchandise - Donations - Contributing Member - LQ Sitemap -