LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 06-25-2013, 06:11 PM   #1
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Rep: Reputation: Disabled
reformatting a giant matrix


Hi boys,

I have a file.txt that has 22800 rows of data by 5 columns across

the first thing I do with it is to split it up into six line pieces with

Code:
split -d -l 6 file.txt
and that gets me 3800 files that are all matrices of row 6 by column 5

then I put them back together (reordered) four at a time with

Code:
 paste x001 x002 x003 x004 > six_by_four_001.txt
then
Code:
 paste x005 x006 x007 x008 > six_by_four_002.txt
and
Code:
 paste x009 x010 x011 x012 > six_by_four_003.txt
and
Code:
 paste x013 x014 x015 x016 > six_by_four_004.txt
and
Code:
 paste x017 x018 x019 x020 > six_by_four_005.txt
and
Code:
 paste x021 x022 x023 x024 > six_by_four_006.txt
and
Code:
 paste x025 x026 x027 x028 > six_by_four_007.txt
and
Code:
 paste x029 x030 x031 x032 > six_by_four_008.txt
and
Code:
 paste x033 x034 x035 x036 > six_by_four_009.txt
and
Code:
 paste x037 x038 x039 x040 > six_by_four_010.txt
this gets me ten files that have 6rows and 20columns then I cat those those together with

Code:
cat six_by_four_001.txt six_by_four_002.txt six_by_four_003.txt six_by_four_004.txt six_by_four_005.txt six_by_four_006.txt six_by_four_007.txt six_by_four_008.txt six_by_four_009.txt six_by_four_010.txt > sixty_by_twenty.txt
as you can see this is a lot of "by hand" and I haven't gotten through all the 3800 files, yikes!!!!

I'd really like some help automating it......

Thanks so much Tabitha!
 
Old 06-25-2013, 06:15 PM   #2
evo2
LQ Guru
 
Registered: Jan 2009
Location: Japan
Distribution: Mostly Debian and Scientific Linux
Posts: 5,753

Rep: Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288
Hi girls,

can you tell us what you are actually trying to achieve? With that information we should be able to suggest an efficient approach.

Evo2.
 
Old 06-25-2013, 06:27 PM   #3
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
I wish that were easier to say I'll try to describe in general:

I have this giant data file with all the data in a single matrix format and I need to have the data re-organized into a different format. The format I need has the elements in different locations and the the giant file broken into multiple smaller files based on other criteria.

sorry it's really hard to explain, but I know the steps I have so far are correct, just a lot of typing, and then I'll have to do it again and again that's why I came asking for help.

Tabby
 
Old 06-25-2013, 06:44 PM   #4
evo2
LQ Guru
 
Registered: Jan 2009
Location: Japan
Distribution: Mostly Debian and Scientific Linux
Posts: 5,753

Rep: Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288
Hi

it would probably be easiest to do this in a language like python, but it could also be automated using the tools you are already using. Eg a quick and dirty script something like:
Code:
#!/bin/bash
split -d -l 6 file.txt
for i in {0..3799} ; do
  a=$(printf 'x%04d' $((i*4+1)) )
  b=$(printf 'x%04d' $((i*4+2)) )
  c=$(printf 'x%04d' $((i*4+3)) )
  d=$(printf 'x%04d' $((i*4+4)) )
  out=$(printf 'six_by_four_%04d.txt' $i)
  paste $a $b $c $d > $out
done
cat six_by_four_*.txt > sixty_by_twenty.txt
\rm six_by_four_*.txt
Evo2.

PS. I'm having to guess here that this is what you actually want since, what you have presented is incomplete...

Last edited by evo2; 06-25-2013 at 06:51 PM. Reason: split output files are numbered from 1, not 0.
 
Old 06-25-2013, 07:12 PM   #5
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,240

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
Given the size of data and the manipulations involved, definitely write a program as mentioned above; my vote goes to Perl.
This would be able to do all you want in one program/pass and do it very quickly.
In case you wondered, Perl is compiled-on-the fly, not interpreted like eg bash.
It has lots of modules you may find useful and and 'extension' called PDL = Perl Data Lang, written especially for this kind of problem.
http://perldoc.perl.org/
 
Old 06-26-2013, 12:24 PM   #6
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
good morning guys!

I kinda misunderstood what I am supposed to do, super sorry!

so what I am supposed to do is to take the 22800_by_5_matrix.txt file and do a bunch of chopping and a bunch of zero padding, and stick parts of it back together. Here's what I've done so far and I'm told the output is right for one "d_block_1" of data. I'll have to create 3800 of these.

create a text file that has a matrix of 6 rows of zeros and 15 columns of zeros I'll call this 6_by_15_block_of_zeros.txt this file will get used many many times
create another text file that has a matrix of 15 rows of zeros and 20 columns of zeros I'll call this 15_by_20_block_of_zeros.txt and this file will get used many many times

Code:
 
split -d -l 6 22800_by_5_file.txt      % this creates 3800 files named x00, x01, x02, x03... x3800 that are 6_by_5_matrices
paste x00 6_by_15_block_of_zeros.txt >  6_by_20_padded_block_of_data.txt  % I think this output is a tmp file which can be overwritten each time this gets looped thru
cat 6_by_20_padded_block_of_data.txt  15_by_20_block_of_zeros.txt >  d_block_00.dat
this creates one 21_by_20 marix of data for the x00 file with the upper left 6_by_5 elements having data and the rest of the elements being zeros, so now I need to loop over this 3800 times.

using evo2's little piece of code I came up with (and I know this isn't right, it doesn't run), but I'm trying...

#!/bin/bash
split -d -l 6 22800_by_5_matrix.txt
for i in {0..3799} ; do
a=$(printf 'x%04d' $(i+1))
out=$(printf 'd_block_%04d.txt' $i)
paste $a 6_by_15_block_of_zeros.txt > $6_by_20_padded_block_of_data.txt
cat $6_by_20_padded_block_of_data.txt 15_by_20_block_of_zeros.txt > $out
done

thanks guys so much for your help!!!

Tabby
 
Old 06-26-2013, 12:58 PM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,245

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
hmmm ... well not sure how fast it will run or even if i have the whole picture yet, but maybe something like:
Code:
#!/usr/bin/awk -f

BEGIN{ 
	end_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0" 
	extra_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"

	cnt = 0
	file_name = "d_block_"

	end = sprintf("%02d.dat",cnt++)
}

{ print $0,end_zeroes > file_name end }

!(NR % 6){

	for(i = 0; i < 15; i++)
		print extra_zeroes > file_name end

	end = sprintf("%02d.dat",cnt++)
}
And you would run it like:
Code:
./script.awk 22800_by_5_matrix.txt
 
Old 06-26-2013, 03:29 PM   #8
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
aaaah, you've come to my rescue again

after running your script (that works perfectly btw) I then execute

Code:
 
cat d_block_* > one_big_padded_file.dat
split -d -a 3 -l 210 one_big_padded_file.dat
and I get out 380 files named

Code:
 x000 x001 x002... x379
perfect and each of them has 210 rows by 20 columns of data per file, double perfect )

the last thing I have to do is rename the files. I think I can do this with move and a for loop, but

I need the 1st group of ten files need to be called: data_13.dat data_16.dat data_19.dat data_22.dat data_25.dat data_28.dat data_31.dat data_34.dat data_37.dat data_40.dat
and then the 2nd group of ten files need to be called: data_13.dat data_16.dat data_19.dat data_22.dat data_25.dat data_28.dat data_31.dat data_34.dat data_37.dat data_40.dat
etc.
etc.
etc.
all the way to the last group of ten files called: data_13.dat data_16.dat data_19.dat data_22.dat data_25.dat data_28.dat data_31.dat data_34.dat data_37.dat data_40.dat

so I know the only way to do this and not overwrite files is to put them in seperate folders. so I have folders f1 f2 f3... f38 and then each of the folders gets a group of ten files moved into them.

so the two tasks I have left are

1) moving groups of 10 files into their proper folder
2) batch renaming the files with incrementing a part of the name


so that's what I'm going to work on, and I'll let you know when I get stuck

thanks so much, Tabby
 
Old 06-27-2013, 04:24 AM   #9
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Distribution: UBUNTU 5.10 since Jul-18,2006 on Intel 820 DC
Posts: 804

Rep: Reputation: 186Reputation: 186
In awk (pseudo code)
In the BEGIN block, define an array of 6 x 20
In the regular block,
1. LET r=(line_no mod 24)
2. If r = 1 then initialize the array. ie. first, 7th, 13th etc record
3. r < 7, then write $1 to $5 of the row to the array element (r,1) to (r,5); break;
4. r < 13, then write $1 to $5 of the row to the array element (r-12,6) to (r-12,10); break;
5. r < 19, then write $1 to $5 of the row to the array element (r-18,11) to (r-18,15); break;
6. r < 25, then write $1 to $5 of the row to the array element (r-24,16) to (r-24,20); break;
7. When the array is full .ie. when you have processed row no(line_no mod 24) then write the array.

This should do it.

OK
 
Old 06-27-2013, 08:38 AM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,245

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
I am with AnanthaP ... I would simply alter the original code to give you your final output. Just think of what a final file would look like and then reverse the process till you get back
to the start.

I guess at the end of the day, it will boil down to how many times you have to do this and how often any of the parameters change, as to whether or not you should create a single script / program.
 
Old 06-27-2013, 12:10 PM   #11
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
yikes! guys, that looks way too scary for me. I have to take little steps and add them together, but I have made a little progress....

after running grail's awk script, then from the system prompt I can run

Code:
 cat d_block_* > one_giant_padded_file.txt
and I get what I want for my next step, but I can't find the right syntax to run it inside of grail's awk script? Is there a way to run the cat command inside the awk script?

I thought maybe
Code:
 
{
cat d_block_* > one_giant_padded_file.txt) 
}
but this has syntax errors about the redirect >

so I tried using the system command
Code:
 
{
system("cat" d_block_* > giant_padded_file.txt) 
}
I still get syntax errors on the redirect > I'm not sure how to fix???

to I could keep working on this, I made the giant_padded_file.txt

and to keep working in the awk world, I wrote
Code:
awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 giant_padded_file.txt
which does do the next step in the script

so now I'd like to move every group of 10 files into it's own directory. I know there will be 38 directories so I could create them beforehand or have the script to it. Moving the files is what I'll work on next...

like cat, could mv also be done as a system command, that might be easier for moving the 10 files at a time???
 
Old 06-27-2013, 12:49 PM   #12
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,245

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
If you are going to use cat then do it separately. you can call command from inside awk but this would not server much purpose as the next step would then be another shell command.
 
Old 06-27-2013, 01:03 PM   #13
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
Hi grail,

I guess I should find a way in awk to cat the 380 files (produced by your script) using awk? and that would be better for when I add in the
Code:
awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 giant_padded_file.txt
part of the script

yes?
 
Old 06-27-2013, 02:50 PM   #14
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
so now I know I can effectively "cat" the files in awk with

Code:
 awk '{print $0}' d_block_* > one_giant_padded_file.txt
but I can't figure out how to add that line of awk to grail's script (I've been trying with lots of goffy out put).
Code:
#!/usr/bin/awk -f

BEGIN{ 
	end_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0" 
	extra_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"

	cnt = 0
	file_name = "d_block_"

	end = sprintf("%02d.dat",cnt++)
}

{ print $0,end_zeroes > file_name end }

!(NR % 6){

	for(i = 0; i < 15; i++)
		print extra_zeroes > file_name end

	end = sprintf("%02d.dat",cnt++)
}

???
I need to learn how to add a line of awk to the end of grail's script so that later on I can add my own
Code:
awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 giant_padded_file.txt
this step by step is probably the prettiest way to do it but I'm learning with little bites

Last edited by atjurhs; 06-27-2013 at 02:54 PM.
 
Old 06-27-2013, 08:00 PM   #15
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,245

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
hmmm ... need to step back a bit. The extra you have done is an entirely new awk script and cannot be simply attached to the previous.

If I understand correctly, you are simply grabbing all the output files created by my awk and then catting them into a new single file .... yes?

If yes, then the original script is now easier Instead of outputting to multiple files, remove the redirect from inside the script and simply output all data to your new big file.
Code:
#!/usr/bin/awk -f

BEGIN{ 
	end_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0" 
	extra_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
}

{ print $0,end_zeroes }

!(NR % 6){

	for(i = 0; i < 15; i++)
		print extra_zeroes
}
And call it like:
Code:
./script.awk 22800_by_5_matrix.txt > one_giant_padded_file.txt
However, as you are now looking at the next part, which is to once again split the data every 210 lines, then this can now be added to the above to change file names every 210 ... I'll let you work out how
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Parallel matrix - matrix multiplication seg-faults ejspeiro Programming 9 04-18-2011 10:41 PM
is there a matrix screensaver, very exactly like in the Matrix movie? frenchn00b Linux - Desktop 2 08-20-2009 11:00 AM
awk convert column matrix to square matrix? johnpaulodonnell Programming 4 04-30-2008 02:45 PM
!!GIANT!! Tux Hitboxx General 12 08-09-2007 10:46 AM
Giant tar's pk21 Linux - General 4 09-04-2003 02:37 PM


All times are GMT -5. The time now is 08:25 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration