LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   reformatting a giant matrix (https://www.linuxquestions.org/questions/linux-newbie-8/reformatting-a-giant-matrix-4175467385/)

atjurhs 07-01-2013 04:02 PM

Well I played with it over the weekend and came up with more pieces (sorry guys I'm really stumbling how to do this in one awk script) then I put the pieces together in a bash script to run it all. Here are the pieces in the bash script:

Code:

#!/usr/bin/awk -f

# this is from grail

BEGIN{
        end_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
        extra_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
}

{ print $0,end_zeroes }

!(NR % 6){

        for(i = 0; i < 15; i++)
                print extra_zeroes
}

then to break the one giant file into the right size files I use

Code:

split -d -a 3 -l 210 one_big_padded_file.dat
this gives file names like x000 x001 x002 etc. etc. which need to be renamed to what I need, so I use

Code:

for f in x* ; do mv "$f" "file_$f" ; done

for f in file_* ; do mv "$f" "$f.dat" ; done

Now all the output files sort of have the right names.

I'd rather do it with this awk command and when I run this awk script from the command line after I run grail's it all works
Code:

awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 giant_padded_file.txt
but I can't figure out how to run them together in one awk script.

I'm lost, Tabby

grail 07-01-2013 05:54 PM

Well my first point would be that there is no need for 2 for loops as you can append to the start and end of the variable as you have done in awk.

To help with putting the awks together, look at the original awk and you will see how the file name was being created.
The only difference now is that instead of changing the name every 6 rows (NR % 6) you are now going to change it at a different point.

The gotcha is, it will not be changing every 210 rows as the new file created with second awk script (giant_padded_file.txt) has had additions.
The math is fairly trivial though.

Let me know how you get on?

atjurhs 07-02-2013 12:27 PM

...I have to take another step back

so say the input file has 240 lines
I want to break up the 240 lines every 6 lines so now I have 40 blocks of data
I want to add to each block 15 columns of zeros and 15 lines of zeros
now I have 21 lines per block, and 40 blocks of data so I get an output file with 840 lines which I've been calling the "one_giant_padded_file.txt"

as grail wrote the awk script that does it, it works perfectly, many many thanks!

now for my awk line command (which kinda works right sorta). I take the one giant file with 840 lines and run this line command.

Code:

awk '!(NR%210) {i++;} {print > "file_"i".dat";}' i=1 one_giant_padded_file.txt
but I get wrong results, I get 5 files, and...

Code:

file_1.dat has 209 lines and the file is missing it's last line
file_2.dat has 210 lines
file_3.dat has 210 lines
file_4.dat has 210 lines
file_5.dat has 1 lines and the one line is all zeros

it looks like file_5.dat is what's supposed to go as the last line of file_1.dat, anyways, the last file should not exist and file_1.dat should have 210 lines and the last line is all zeros. Is there suposed to be a stop somewhere in the command so it doesn't loop back around, idk.

grail, can you please help me, I've been trying to fix this scince yesterday afternoon

thank you, Tabby

grail 07-02-2013 01:16 PM

You need to think about order of execution.
Code:

!(NR%210) {i++;}
This says, when you reach the 210th line of the current file, increase the counter by 1 ... so where do you think the 210th line (first round) will go??

Once you have this ... you can then simply add this into the original script ;)

PTrenholme 07-02-2013 02:41 PM

Gail's approach to answering question is to provide a fish hook, pole, and line; I prefer to offer a little advice about how to use the fishing equipment and something about where the fish live. . .

So, a suggestion: See if your system responds to pinfo gawk or the older info gawk.

Here's an UNTESTED modification of Gail's program, with some added comments.

Code:

#!/usr/bin/gawk -f
# This section is run once, before any input is processed
BEGIN {
        end_zeroes =  "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
        extra_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
# Initial output file name
        output_file=sprintf("file_%d.dat",++output_file_count)
# Number of lines written to this output file
        nout=0
}

#################################################
#
# These blocks (i.e., 'test {statements}')
# are executed, in the order in which they appear,
# for each line read from any input file.
#
# ("file names" in the form "x=y"
# set the value of x to y and are processed
# when read AFTER the BEGIN section (if any) is
# run.)
#
##################################################
#
# Do we now have 210 lines in the output file?
(nout == 210) {
# Get the next output file name
        output_file=sprintf("file_%d.dat",++output_file_count)
# And reset the output line count to zero
        nout=0
}

# Copy the next input file line to the current output file and increment the output line count
{      print $0 end_zeroes > output_file
        ++nout
}

# If the number of records read is a multiple of 6, add 15 lines of zeros
# and increment the output line count by 15
!(NR % 6) {
        for(i = 0; i < 15; i++) {
            print extra_zeroes > output_file
        }
        nout += 15
}
#############################################
# This block in run after the last input file
# is read.
#############################################
#
# Write some summary info to the console
END {incurment
  print "Done. Wrote " output_file_count " files."
}

Note: The first line (starting with the "shebang," #! would be used by a Linux system if you saved the code to a file and made it executable (chmod u+x code_file_name) so you could run it as a command (e,g., $ code_file_name input_file(s)).

atjurhs 07-02-2013 03:28 PM

grail

I got it, I got it, wohoooo :) :) :)

Code:

awk 'NR%210==1 {"file_"i".dat";i++;} {print > "file_"i".dat"}' i=0 giant_input.file
that was hard, at least for me it was.

if there is something that I should change to stop any errors/bugs that I don't know about please tell me


PTrenholme, I'll give yours a look over too...

excited/happy Tabby

atjurhs 07-02-2013 07:50 PM

well guys I think that does it. thanks sooooo much for all your help!

Tabby

grail 07-03-2013 03:59 AM

I agree with PTrenholme's analogy that I provide direction as opposed to answers, but generally only to those that seem to be following :)

Glad you found a solution. Now that you have one, here is what I would look at:

1. Your final solution works, which is cool, but what I was pointing at in my last advice was that by simply changing the position of the increment you would achieve the same affect:
Code:

awk '{print > "file_"i".dat"}!(NR%210) {i++}' i=1 one_giant_padded_file.txt
2. As I pointed out, this can then be added to the original script to output the data from the original file:
Code:

#!/usr/bin/awk -f

BEGIN{
        end_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
        extra_zeroes = "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"

        cnt = 0

        file_name = sprintf("file_%02d.dat",++cnt)
}

{ print $0,end_zeroes > file_name }

!(NR % 6){

        for(i = 0; i < 15; i++)
                print extra_zeroes > file_name end
}

!(NR%60){ file_name = sprintf("file_%02d.dat",++cnt) }


atjurhs 07-03-2013 09:07 AM

good morning Grail, you get up too early, even I'm up too early today :)

I did try to follow your direction of moving the iterator, but...

Code:

warning these 2 awk commands do not work

awk '!(NR%210) {print > "file_"i".dat";} {i++;} ' i=1 one_giant_padded_file.txt

and I tried putting it inside the print statment

awk '!(NR%210) {print > "file_"i".dat" i++;} ' i=1 one_giant_padded_file.txt

you know (and I found out), that this keeps getting syntax erros that I couldn't work around.

what I couldn't think past was not having the seperating statment
Code:

!(NR%210)
at the very beginning of the command, and even in my awk command that works I had the seperating statement at the beginning.

the "combined script" is definetly beyound my coding ability. I've never had a class in any sort of programming, so I'm learning awk, sed, and bash writting on my own because alot of what I do is re-formating and re-configuring files and directories to run in already existing programs. I do write psuedo code to organize my thoughts, but putting into real code is much tuffer for me, like I can understand what for and while loops do, but I'll pull my hair out trying to write one, and having 3 print statments in that combined script no way would I have got, so I'll keep working at it, and thank you so much for your help!

Thanks so much, Tabby

please read my PM to you, ahhhh, I haven't figured out how to do that, is there a link somewhere?

tabbyagirl 07-03-2013 01:27 PM

I found out I can't send a PM, so out in the open, the pepole I work with asked me to change my username here, so I did. My new username here is "tabbyagirl"

grail 07-04-2013 03:54 AM

Well I would need more information on any error messages to help with them.

Looking at the 2 lines you have in post #24, neither would work well for what you want, but I shall try to explain:
Code:

awk '!(NR%210) {print > "file_"i".dat";} {i++;} ' i=1 one_giant_padded_file.txt
There are 2 issues here:

1. The 'i' variable is now going to increase for every line read in the file, ie by the end of the script it will be 841

2. As you now have the condition '!(NR%210)' prior to your print command, it will only print every 210th line, ie only 4 single lines, one per file will be printed
Code:

awk '!(NR%210) {print > "file_"i".dat" i++;} ' i=1 one_giant_padded_file.txt
Here you have the same issue as above for printing, but now the value for 'i' will only get to 4

If you look at my example:
Code:

awk '{print > "file_"i".dat"}!(NR%210) {i++}' i=1 one_giant_padded_file.txt
{print > "file_"i".dat"} - This will print every line of one_giant_padded_file.txt into a new file called 'file_N.dat' where N starts at 1

!(NR%210) {i++} - This tells awk that when NR is evenly divisible by 210 that the variable 'i' will be increased by 1, hence our file of 840 lines will force the variable to be increased 4 times

Note: Even though 'i' is increased 4 times, the last value of 'i' is 5 but it is never used


Lastly, instead of comparing the new script from post #23 to the previous version in post #15, compare it instead to the one in post #7 as apart from a slight change in the BEGIN section
the following is the only new line:
Code:

!(NR%60){ file_name = sprintf("file_%02d.dat",++cnt) }
Hope some of this helps

tabbyagirl 07-05-2013 11:33 AM

that helps VERY much, in learning what's going as it steps through the lines of code.

a friend of mine has some C code development tool that let's him step through each line so he can see what's happening, kinda like you explained up above. Do they hav such a thing for scripting languages? converting over to C seems like a BIG step, IDK

Tabby

grail 07-05-2013 12:36 PM

Scripting in bash you can use the following as second line in script to set logging of a sorts:
Code:

set -xv
As for awk, or really any language whilst learning, I believe your best friend is the standard print / echo statement. Simply redirect all variables each time they change
to a separate file (or on screen if only a few lines) and then you can track down where things have gone wrong.

Other options like the one above or something like gdb to step through C code can be adopted later when executing much larger programs / scripts :)

PTrenholme 07-09-2013 09:14 AM

There is also a "full-fledged" gawk debugger available. It's described in the gawk info file to which I referred you above.

Basically, instead of, for example, gawk '{print > "file_"i".dat"}!(NR%210) {i++}' i=1 one_giant_padded_file.txt you would use dgawk '{print > "file_"i".dat"}!(NR%210) {i++}' i=1 one_giant_padded_file.txt

If your 'C' friend is familiar with gdb usage, dgawk commands are similar to those. The info section on "debugging" describes the usage fairly well.

A comment that. hopefully, will help you understand where you're loosing the track:

In a "one-line" command like awk '{print > "file_"i".dat"}!(NR%210) {i++}' i=1 one_giant_padded_file.txt, the "stuff" between the single quotes is a gawk program and the rest of the line are the argument for that program. You could, instead of that "on-line" program, done this:
Code:

$ cat > tmp.gawk
# Do this for every input line (I.e., No condition precedes the expression.)
{
  print > "file_" i ".dat"
}
# Do this whenever the number of records read is a multiple of 210
# (I.e., when the remainder of (NR / 210) is zero)
!(NR%210) {
  i += 1
}
^C
$ gawk -f tmp.gawk i=1 one_giant_padded_file

Note that comments and spaces are ignored in gawk code, and that (generally - quoted strings may contain almost any character, including new line characters, and a few other exceptions), new line characters and semi-colon character are equivalent and required to separate program statements in expressions.


All times are GMT -5. The time now is 10:15 AM.