LinuxQuestions.org - awk - rearrange column data in csv file to match columns

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - awk - rearrange column data in csv file to match columns (https://www.linuxquestions.org/questions/programming-9/awk-rearrange-column-data-in-csv-file-to-match-columns-919590/)

awk - rearrange column data in csv file to match columns

Hi,

I have a csv file that I created from two different device lists. One lists the devices in our trouble ticketing system, the other lists devices in our network monitoring application. In some cases, the device is in the ticketing system but not the network monitoring software, in some cases the device is in our network monitoring software, but not our ticketing system, and sometimes it is in both. I put the data in one csv file and I am trying to line up all of the records by IP address. The desired format when viewed with Open Office Calc (or Excel) is as follows (I changed the actual data to make it generic, but I can provide a snippet of the actual file if needed):

Code:



TS= Ticketing System

MS= Monitoring System

In backup | NodeID-TS | Name-TS | In TS    |  In TS but not MS | In Both  |  In MS but not TS | In MS      | Name-TS  | NodeID-MS



Yes        11111      node1    10.10.10.1                      10.10.10.1                    10.10.10.1  node_1    11111

No          22222      node2    10.10.10.2                      10.10.10.2                    10.10.10.2  node_2    22222

Yes        33333      node3    10.10.10.3    10.10.10.3                                                    

                                                                              10.10.10.4        10.10.10.4  node_4    44444

Yes        55555      node5    10.10.10.5                      10.10.10.5                    10.10.10.5  node_5    55555

                                                                              10.10.10.6        10.10.10.6  node_6    66666

No          77777      node7    10.10.10.7    10.10.10.7        

Yes        88888      node8    10.10.10.8    10.10.10.8        

                                                                              10.10.10.9        10.10.10.9  node_9    99999

However, currently the "In MS" column is not lined up with the rest of the document and it looks like this:

Code:



In backup | NodeID-TS | Name-TS | In TS    |  In TS but not MS | In Both  |  In MS but not TS | In MS      | Name-TS  | NodeID-MS



Yes        11111      node1    10.10.10.1                      10.10.10.1                    10.10.10.1  node_1    11111

No          22222      node2    10.10.10.2                      10.10.10.2                    10.10.10.2  node_2    22222

Yes        33333      node3    10.10.10.3    10.10.10.3                      10.10.10.4        10.10.10.4  node_4    44444

                                                                                                10.10.10.5  node_5    55555

Yes        55555      node5    10.10.10.5                      10.10.10.5  10.10.10.6        10.10.10.6  node_6    66666                  

                                                                              10.10.10.9        10.10.10.9  node_9    99999

No          77777      node7    10.10.10.7    10.10.10.7        

Yes        88888      node8    10.10.10.8    10.10.10.8

Here's what it looks like as a .csv file in a text editor, for sample purposes:

Code:



In backup,NodeID-TS,Name-TS,In TS,In TS but not MS,In Both,In MS but not TS,In MS,Name-TS,NodeID-MS

"Yes",11111,"node1","10.10.10.1",,"10.10.10.1",,"10.10.10.1","node_1",11111

"No",22222,"node2","10.10.10.2",,"10.10.10.2",,"10.10.10.2","node_2",22222

"Yes",33333,"node3","10.10.10.3","10.10.10.3",,"10.10.10.4","10.10.10.4","node_4",44444

,,,,,,,"10.10.10.5","node_5",55555

"Yes",55555,"node5","10.10.10.5",,"10.10.10.5","10.10.10.6","10.10.10.6","node_6",66666

,,,,,,"10.10.10.9","10.10.10.9","node_9",99999

"No",77777,"node7","10.10.10.7","10.10.10.7",,,,,

"Yes",88888,"node8","10.10.10.8","10.10.10.8",,,,,

Basically I need it so that if a node is in both systems, the line pretty much stays the same. If it is in only one system, that IP and associated record information needs to have its own line. The IP's need to be in order so the records are easier to work with. So far, I've tried this:

Code:



#!/bin/bash



cat mes-SHEET_copy.csv | while read line

 do



clmnA=$(echo $line | cut -d"," -f1)

clmnB=$(echo "$line" | cut -d"," -f2)

clmnC=$(echo "$line" | cut -d"," -f3)

clmnD=$(echo "$line" | cut -d"," -f4)

clmnE=$(echo "$line" | cut -d"," -f5)

clmnF=$(echo "$line" | cut -d"," -f6)

clmnG=$(echo "$line" | cut -d"," -f7)

clmnH=$(echo "$line" | cut -d"," -f8)

clmnI=$(echo "$line" | cut -d"," -f9)

clmnJ=$(echo "$line" | cut -d"," -f10)



 if [ $clmnE != $clmnF ]; then

 

 echo "$clmnA,$clmnB,$clmnC,$clmnD,$clmnE,$clmnF,,,,,"

 

 elif [ $clmnF != $clmnG ]; then

 

 echo ",,,,,,$clmnF,$clmnG,$clmnH,$clmnI,$clmnJ,"



 elif [ -n $clmnF ]; then

 

 echo "$clmnA,$clmnB,$clmnC,$clmnD,$clmnE,$clmnF,$clmnG,$clmnH,$clmnI,$clmnJ"



 fi



 done

and this:

Code:



#!/bin/bash



cat mes-SHEET_copy.csv | while read line

 do



clmnF=$(echo "$line" | awk -F"," '{print $6}')

clmnG=$(echo "$line" | awk -F"," '{print $7}')

clmnH=$(echo "$line" | awk -F"," '{print $8}')



 if [ $clmnF=$clmnH ]; then



 echo $line



 else echo $line | awk -F"," '{print $1,$2,$3,$4,$5,$6}'



fi



done

...but it has occurred to me that none of these will work, because it will not "save" the information that was already in columns 7, 8, and 9 and put it on the next line.

Anyone have any ideas? I really don't want to go through 7,000 lines manually. Thanks in advance.

Please include the header information (ie what does each field represent) for the csv as this will help work out how to arrive at your required output.

Hi.

Try the following awk script

Code:

BEGIN{

        FS=","

        OFS=","

        in_ts        = 4

        in_ms        = 8



        ts_only        = 5

        both    = 6

        ms_only        = 7



        max        = 10

}

$in_ts { ts[$in_ts]=$0 }

$in_ms { ms[$in_ms]=$0 }



END{

        for(l in ts){

                $0 = ts[l]

                $both = ""

                $ts_only = ""

                $ms_only = ""

                if(l in ms){

                        $both = l



                        split(ms[l], tmp)

                        for(i=in_ms; i<=max; i++)

                                $i = tmp[i]

                }

                else {

                              $ts_only = l

                        for(i=ts_only+1; i<=max; i++)

                                $i = ""

                }



                result[l] = $0

                delete ms[l]

        }



        # Now ms contain "ms_only" records

        for(l in ms) {

                $0 = ms[l]

                $ms_only = l

                for(i=1; i<ms_only; i++)

                        $i = ""

                result[l] = $0

        }



        for(l in result)

                print result[l]

}

Interesting parts here (relevant quotes from `man awk'):
1)

Quote:

Fields need not be referenced by constants:

n = 5
print $n

prints the fifth field in the input record.

Quote:

References to non-existent fields (i.e. fields after $NF) produce the null-string. However, assigning to a
non-existent field (e.g., $(NF+2) = 5) increases the value of NF, creates any intervening fields with the null
string as their value, and causes the value of $0 to be recomputed, with the fields being separated by the
value of OFS. References to negative numbered fields cause a fatal error. Decrementing NF causes the values
of fields past the new value to be lost, and the value of $0 to be recomputed, with the fields being separated
by the value of OFS.

Assigning a value to an existing field causes the whole record to be rebuilt when $0 is referenced. Similarly,
assigning a value to $0 causes the record to be resplit, creating new values for the fields.

You can use `asorti()' to sort `result' by IP address if you need to.

Let me know if you need further explanation.
Hope this helps.

Quote:

Originally Posted by firstfire (Post 4554263)

Hi.

Try the following awk script

Code:

BEGIN{

        FS=","

        OFS=","

        in_ts        = 4

        in_ms        = 8



        ts_only        = 5

        both    = 6

        ms_only        = 7



        max        = 10

}

$in_ts { ts[$in_ts]=$0 }

$in_ms { ms[$in_ms]=$0 }



END{

        for(l in ts){

                $0 = ts[l]

                $both = ""

                $ts_only = ""

                $ms_only = ""

                if(l in ms){

                        $both = l



                        split(ms[l], tmp)

                        for(i=in_ms; i<=max; i++)

                                $i = tmp[i]

                }

                else {

                              $ts_only = l

                        for(i=ts_only+1; i<=max; i++)

                                $i = ""

                }



                result[l] = $0

                delete ms[l]

        }



        # Now ms contain "ms_only" records

        for(l in ms) {

                $0 = ms[l]

                $ms_only = l

                for(i=1; i<ms_only; i++)

                        $i = ""

                result[l] = $0

        }



        for(l in result)

                print result[l]

}

Interesting parts here (relevant quotes from `man awk'):
1)

2)

You can use `asorti()' to sort `result' by IP address if you need to.

Let me know if you need further explanation.
Hope this helps.

One thing -- how do I execute this? is it ./script < file.csv ?

Quote:

Originally Posted by grail (Post 4554260)

Please include the header information (ie what does each field represent) for the csv as this will help work out how to arrive at your required output.

grail - I have the headers in the "excel-like" examples, but forgot to put them in the csv-style example. I added them just now. Thanks.

Quote:

Originally Posted by wolverene13 (Post 4554644)

One thing -- how do I execute this? is it ./script < file.csv ?

Good question! :)

Code:

awk -f script.awk < file.csv > out.csv

"<" is optional.

P.S. to skip the header add the line (in bold)

Code:

...

        max    = 10

}

FNR == 1 {print; next}

$in_ts { ts[$in_ts]=$0 }

...

Quote:

Originally Posted by firstfire (Post 4554651)

Good question! :)

Code:

awk -f script.awk < file.csv > out.csv

"<" is optional.

P.S. to skip the header add the line (in bold)

Code:

...

        max    = 10

}

FNR == 1 {print; next}

$in_ts { ts[$in_ts]=$0 }

...

I'm getting a syntax error near line 13 for some reason:

Code:

awk -f awk_test_script.sh < mes-SHEET.csv > mes-out.csv

awk: syntax error near line 13

awk: bailing out near line 13

Quote:

Originally Posted by wolverene13 (Post 4554669)

I'm getting a syntax error near line 13 for some reason:

Code:

awk -f awk_test_script.sh < mes-SHEET.csv > mes-out.csv

awk: syntax error near line 13

awk: bailing out near line 13

It looks like your awk is old (according to this link). Try using nawk or gawk instead. On my machine script works fine

Code:

$ awk -f rearrange.awk test.csv

In backup,NodeID-TS,Name-TS,In TS,In TS but not MS,In Both,In MS but not TS,In MS,Name-TS,NodeID-MS

"Yes",11111,"node1","10.10.10.1",,"10.10.10.1",,"10.10.10.1","node_1",11111

"No",22222,"node2","10.10.10.2",,"10.10.10.2",,"10.10.10.2","node_2",22222

"Yes",33333,"node3","10.10.10.3","10.10.10.3",,,,,

,,,,,,"10.10.10.4","10.10.10.4","node_4",44444

"Yes",55555,"node5","10.10.10.5",,"10.10.10.5",,"10.10.10.5","node_5",55555

,,,,,,"10.10.10.6","10.10.10.6","node_6",66666

"No",77777,"node7","10.10.10.7","10.10.10.7",,,,,

"Yes",88888,"node8","10.10.10.8","10.10.10.8",,,,,

,,,,,,"10.10.10.9","10.10.10.9","node_9",99999

My awk vesion is GNU Awk 3.1.8.

Here is another direction for the same solution:

Code:

#!/usr/bin/awk -f



BEGIN{

        FS = ","

        front = "%-10s%-10s%-10s%-15s%-17s"

        back = "%-17s%-15s%-10s%-10s\n"

        format = front"%-15s"back

}



{  gsub(/"/,"")    }



NR == 1 || $2 == $10{ printf format,$1,$2,$3,$4,$5,$6,$7,$8,$9,$10;next }



{

    printf front,$1,$2,$3,$4,$5



    if( end[x] && ( ! $2 || $2 == x ) ){

        mid = " "

        if( $2 == x )

                mid = $4



        printf "%-15s%s",mid,end[x]

        delete end

    }

    else

        printf "\n"



    if( ! end[x] ){

        end[$10] = sprintf(back,$7,$8,$9,$10)

        x = $10

    }

}



END{

    if( end[x] )

        printf "%-77s%s"," ",end[x]

}

Simply run as:

Code:

./script.awk file.csv

Quote:

Originally Posted by firstfire (Post 4554956)

It looks like your awk is old (according to this link). Try using nawk or gawk instead. On my machine script works fine

Code:

$ awk -f rearrange.awk test.csv

In backup,NodeID-TS,Name-TS,In TS,In TS but not MS,In Both,In MS but not TS,In MS,Name-TS,NodeID-MS

"Yes",11111,"node1","10.10.10.1",,"10.10.10.1",,"10.10.10.1","node_1",11111

"No",22222,"node2","10.10.10.2",,"10.10.10.2",,"10.10.10.2","node_2",22222

"Yes",33333,"node3","10.10.10.3","10.10.10.3",,,,,

,,,,,,"10.10.10.4","10.10.10.4","node_4",44444

"Yes",55555,"node5","10.10.10.5",,"10.10.10.5",,"10.10.10.5","node_5",55555

,,,,,,"10.10.10.6","10.10.10.6","node_6",66666

"No",77777,"node7","10.10.10.7","10.10.10.7",,,,,

"Yes",88888,"node8","10.10.10.8","10.10.10.8",,,,,

,,,,,,"10.10.10.9","10.10.10.9","node_9",99999

My awk vesion is GNU Awk 3.1.8.

Yes, that's the problem -- at work we are using Solaris, and it doesn't work there, but it looks like it works on my Debian machine at home without a problem. I'll run it for real and let you know if it worked.