LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   awk - rearrange column data in csv file to match columns (https://www.linuxquestions.org/questions/programming-9/awk-rearrange-column-data-in-csv-file-to-match-columns-919590/)

wolverene13 12-19-2011 09:23 PM

awk - rearrange column data in csv file to match columns
 
Hi,

I have a csv file that I created from two different device lists. One lists the devices in our trouble ticketing system, the other lists devices in our network monitoring application. In some cases, the device is in the ticketing system but not the network monitoring software, in some cases the device is in our network monitoring software, but not our ticketing system, and sometimes it is in both. I put the data in one csv file and I am trying to line up all of the records by IP address. The desired format when viewed with Open Office Calc (or Excel) is as follows (I changed the actual data to make it generic, but I can provide a snippet of the actual file if needed):

Code:


TS= Ticketing System
MS= Monitoring System
In backup | NodeID-TS | Name-TS | In TS    |  In TS but not MS | In Both  |  In MS but not TS | In MS      | Name-TS  | NodeID-MS

Yes        11111      node1    10.10.10.1                      10.10.10.1                    10.10.10.1  node_1    11111
No          22222      node2    10.10.10.2                      10.10.10.2                    10.10.10.2  node_2    22222
Yes        33333      node3    10.10.10.3    10.10.10.3                                                   
                                                                              10.10.10.4        10.10.10.4  node_4    44444
Yes        55555      node5    10.10.10.5                      10.10.10.5                    10.10.10.5  node_5    55555
                                                                              10.10.10.6        10.10.10.6  node_6    66666
No          77777      node7    10.10.10.7    10.10.10.7       
Yes        88888      node8    10.10.10.8    10.10.10.8       
                                                                              10.10.10.9        10.10.10.9  node_9    99999

However, currently the "In MS" column is not lined up with the rest of the document and it looks like this:

Code:


In backup | NodeID-TS | Name-TS | In TS    |  In TS but not MS | In Both  |  In MS but not TS | In MS      | Name-TS  | NodeID-MS

Yes        11111      node1    10.10.10.1                      10.10.10.1                    10.10.10.1  node_1    11111
No          22222      node2    10.10.10.2                      10.10.10.2                    10.10.10.2  node_2    22222
Yes        33333      node3    10.10.10.3    10.10.10.3                      10.10.10.4        10.10.10.4  node_4    44444
                                                                                                10.10.10.5  node_5    55555
Yes        55555      node5    10.10.10.5                      10.10.10.5  10.10.10.6        10.10.10.6  node_6    66666                 
                                                                              10.10.10.9        10.10.10.9  node_9    99999
No          77777      node7    10.10.10.7    10.10.10.7       
Yes        88888      node8    10.10.10.8    10.10.10.8

Here's what it looks like as a .csv file in a text editor, for sample purposes:

Code:


In backup,NodeID-TS,Name-TS,In TS,In TS but not MS,In Both,In MS but not TS,In MS,Name-TS,NodeID-MS
"Yes",11111,"node1","10.10.10.1",,"10.10.10.1",,"10.10.10.1","node_1",11111
"No",22222,"node2","10.10.10.2",,"10.10.10.2",,"10.10.10.2","node_2",22222
"Yes",33333,"node3","10.10.10.3","10.10.10.3",,"10.10.10.4","10.10.10.4","node_4",44444
,,,,,,,"10.10.10.5","node_5",55555
"Yes",55555,"node5","10.10.10.5",,"10.10.10.5","10.10.10.6","10.10.10.6","node_6",66666
,,,,,,"10.10.10.9","10.10.10.9","node_9",99999
"No",77777,"node7","10.10.10.7","10.10.10.7",,,,,
"Yes",88888,"node8","10.10.10.8","10.10.10.8",,,,,

Basically I need it so that if a node is in both systems, the line pretty much stays the same. If it is in only one system, that IP and associated record information needs to have its own line. The IP's need to be in order so the records are easier to work with. So far, I've tried this:

Code:


#!/bin/bash

cat mes-SHEET_copy.csv | while read line
 do

clmnA=$(echo $line | cut -d"," -f1)
clmnB=$(echo "$line" | cut -d"," -f2)
clmnC=$(echo "$line" | cut -d"," -f3)
clmnD=$(echo "$line" | cut -d"," -f4)
clmnE=$(echo "$line" | cut -d"," -f5)
clmnF=$(echo "$line" | cut -d"," -f6)
clmnG=$(echo "$line" | cut -d"," -f7)
clmnH=$(echo "$line" | cut -d"," -f8)
clmnI=$(echo "$line" | cut -d"," -f9)
clmnJ=$(echo "$line" | cut -d"," -f10)

 if [ $clmnE != $clmnF ]; then
 
 echo "$clmnA,$clmnB,$clmnC,$clmnD,$clmnE,$clmnF,,,,,"
 
 elif [ $clmnF != $clmnG ]; then
 
 echo ",,,,,,$clmnF,$clmnG,$clmnH,$clmnI,$clmnJ,"

 elif [ -n $clmnF ]; then
 
 echo "$clmnA,$clmnB,$clmnC,$clmnD,$clmnE,$clmnF,$clmnG,$clmnH,$clmnI,$clmnJ"

 fi

 done

and this:

Code:


#!/bin/bash

cat mes-SHEET_copy.csv | while read line
 do

clmnF=$(echo "$line" | awk -F"," '{print $6}')
clmnG=$(echo "$line" | awk -F"," '{print $7}')
clmnH=$(echo "$line" | awk -F"," '{print $8}')

 if [ $clmnF=$clmnH ]; then

 echo $line

 else echo $line | awk -F"," '{print $1,$2,$3,$4,$5,$6}'

fi

done

...but it has occurred to me that none of these will work, because it will not "save" the information that was already in columns 7, 8, and 9 and put it on the next line.

Anyone have any ideas? I really don't want to go through 7,000 lines manually. Thanks in advance.

grail 12-20-2011 01:44 AM

Please include the header information (ie what does each field represent) for the csv as this will help work out how to arrive at your required output.

firstfire 12-20-2011 01:53 AM

Hi.

Try the following awk script
Code:

BEGIN{
        FS=","
        OFS=","
        in_ts        = 4
        in_ms        = 8

        ts_only        = 5
        both    = 6
        ms_only        = 7

        max        = 10
}
$in_ts { ts[$in_ts]=$0 }
$in_ms { ms[$in_ms]=$0 }

END{
        for(l in ts){
                $0 = ts[l]
                $both = ""
                $ts_only = ""
                $ms_only = ""
                if(l in ms){
                        $both = l

                        split(ms[l], tmp)
                        for(i=in_ms; i<=max; i++)
                                $i = tmp[i]
                }
                else {
                              $ts_only = l
                        for(i=ts_only+1; i<=max; i++)
                                $i = ""
                }

                result[l] = $0
                delete ms[l]
        }

        # Now ms contain "ms_only" records
        for(l in ms) {
                $0 = ms[l]
                $ms_only = l
                for(i=1; i<ms_only; i++)
                        $i = ""
                result[l] = $0
        }

        for(l in result)
                print result[l]
}

Interesting parts here (relevant quotes from `man awk'):
1)
Quote:

Fields need not be referenced by constants:

n = 5
print $n

prints the fifth field in the input record.
2)
Quote:

References to non-existent fields (i.e. fields after $NF) produce the null-string. However, assigning to a
non-existent field (e.g., $(NF+2) = 5) increases the value of NF, creates any intervening fields with the null
string as their value, and causes the value of $0 to be recomputed, with the fields being separated by the
value of OFS. References to negative numbered fields cause a fatal error. Decrementing NF causes the values
of fields past the new value to be lost, and the value of $0 to be recomputed, with the fields being separated
by the value of OFS.

Assigning a value to an existing field causes the whole record to be rebuilt when $0 is referenced. Similarly,
assigning a value to $0 causes the record to be resplit, creating new values for the fields.
You can use `asorti()' to sort `result' by IP address if you need to.

Let me know if you need further explanation.
Hope this helps.

wolverene13 12-20-2011 01:16 PM

Quote:

Originally Posted by firstfire (Post 4554263)
Hi.

Try the following awk script
Code:

BEGIN{
        FS=","
        OFS=","
        in_ts        = 4
        in_ms        = 8

        ts_only        = 5
        both    = 6
        ms_only        = 7

        max        = 10
}
$in_ts { ts[$in_ts]=$0 }
$in_ms { ms[$in_ms]=$0 }

END{
        for(l in ts){
                $0 = ts[l]
                $both = ""
                $ts_only = ""
                $ms_only = ""
                if(l in ms){
                        $both = l

                        split(ms[l], tmp)
                        for(i=in_ms; i<=max; i++)
                                $i = tmp[i]
                }
                else {
                              $ts_only = l
                        for(i=ts_only+1; i<=max; i++)
                                $i = ""
                }

                result[l] = $0
                delete ms[l]
        }

        # Now ms contain "ms_only" records
        for(l in ms) {
                $0 = ms[l]
                $ms_only = l
                for(i=1; i<ms_only; i++)
                        $i = ""
                result[l] = $0
        }

        for(l in result)
                print result[l]
}

Interesting parts here (relevant quotes from `man awk'):
1)


2)


You can use `asorti()' to sort `result' by IP address if you need to.

Let me know if you need further explanation.
Hope this helps.

One thing -- how do I execute this? is it ./script < file.csv ?

wolverene13 12-20-2011 01:17 PM

Quote:

Originally Posted by grail (Post 4554260)
Please include the header information (ie what does each field represent) for the csv as this will help work out how to arrive at your required output.

grail - I have the headers in the "excel-like" examples, but forgot to put them in the csv-style example. I added them just now. Thanks.

firstfire 12-20-2011 01:24 PM

Quote:

Originally Posted by wolverene13 (Post 4554644)
One thing -- how do I execute this? is it ./script < file.csv ?

Good question! :)
Code:

awk -f script.awk < file.csv > out.csv
"<" is optional.

P.S. to skip the header add the line (in bold)
Code:

...
        max    = 10
}
FNR == 1 {print; next}
$in_ts { ts[$in_ts]=$0 }
...


wolverene13 12-20-2011 02:15 PM

Quote:

Originally Posted by firstfire (Post 4554651)
Good question! :)
Code:

awk -f script.awk < file.csv > out.csv
"<" is optional.

P.S. to skip the header add the line (in bold)
Code:

...
        max    = 10
}
FNR == 1 {print; next}
$in_ts { ts[$in_ts]=$0 }
...


I'm getting a syntax error near line 13 for some reason:

Code:

awk -f awk_test_script.sh < mes-SHEET.csv > mes-out.csv
awk: syntax error near line 13
awk: bailing out near line 13


firstfire 12-20-2011 11:04 PM

Quote:

Originally Posted by wolverene13 (Post 4554669)
I'm getting a syntax error near line 13 for some reason:

Code:

awk -f awk_test_script.sh < mes-SHEET.csv > mes-out.csv
awk: syntax error near line 13
awk: bailing out near line 13


It looks like your awk is old (according to this link). Try using nawk or gawk instead. On my machine script works fine
Code:

$ awk -f rearrange.awk test.csv
In backup,NodeID-TS,Name-TS,In TS,In TS but not MS,In Both,In MS but not TS,In MS,Name-TS,NodeID-MS
"Yes",11111,"node1","10.10.10.1",,"10.10.10.1",,"10.10.10.1","node_1",11111
"No",22222,"node2","10.10.10.2",,"10.10.10.2",,"10.10.10.2","node_2",22222
"Yes",33333,"node3","10.10.10.3","10.10.10.3",,,,,
,,,,,,"10.10.10.4","10.10.10.4","node_4",44444
"Yes",55555,"node5","10.10.10.5",,"10.10.10.5",,"10.10.10.5","node_5",55555
,,,,,,"10.10.10.6","10.10.10.6","node_6",66666
"No",77777,"node7","10.10.10.7","10.10.10.7",,,,,
"Yes",88888,"node8","10.10.10.8","10.10.10.8",,,,,
,,,,,,"10.10.10.9","10.10.10.9","node_9",99999

My awk vesion is GNU Awk 3.1.8.

grail 12-21-2011 12:44 AM

Here is another direction for the same solution:
Code:

#!/usr/bin/awk -f

BEGIN{
        FS = ","
        front = "%-10s%-10s%-10s%-15s%-17s"
        back = "%-17s%-15s%-10s%-10s\n"
        format = front"%-15s"back
}

{  gsub(/"/,"")    }

NR == 1 || $2 == $10{ printf format,$1,$2,$3,$4,$5,$6,$7,$8,$9,$10;next }

{
    printf front,$1,$2,$3,$4,$5

    if( end[x] && ( ! $2 || $2 == x ) ){
        mid = " "
        if( $2 == x )
                mid = $4

        printf "%-15s%s",mid,end[x]
        delete end
    }
    else
        printf "\n"

    if( ! end[x] ){
        end[$10] = sprintf(back,$7,$8,$9,$10)
        x = $10
    }
}

END{
    if( end[x] )
        printf "%-77s%s"," ",end[x]
}

Simply run as:
Code:

./script.awk file.csv

wolverene13 12-21-2011 04:55 AM

Quote:

Originally Posted by firstfire (Post 4554956)
It looks like your awk is old (according to this link). Try using nawk or gawk instead. On my machine script works fine
Code:

$ awk -f rearrange.awk test.csv
In backup,NodeID-TS,Name-TS,In TS,In TS but not MS,In Both,In MS but not TS,In MS,Name-TS,NodeID-MS
"Yes",11111,"node1","10.10.10.1",,"10.10.10.1",,"10.10.10.1","node_1",11111
"No",22222,"node2","10.10.10.2",,"10.10.10.2",,"10.10.10.2","node_2",22222
"Yes",33333,"node3","10.10.10.3","10.10.10.3",,,,,
,,,,,,"10.10.10.4","10.10.10.4","node_4",44444
"Yes",55555,"node5","10.10.10.5",,"10.10.10.5",,"10.10.10.5","node_5",55555
,,,,,,"10.10.10.6","10.10.10.6","node_6",66666
"No",77777,"node7","10.10.10.7","10.10.10.7",,,,,
"Yes",88888,"node8","10.10.10.8","10.10.10.8",,,,,
,,,,,,"10.10.10.9","10.10.10.9","node_9",99999

My awk vesion is GNU Awk 3.1.8.

Yes, that's the problem -- at work we are using Solaris, and it doesn't work there, but it looks like it works on my Debian machine at home without a problem. I'll run it for real and let you know if it worked.


All times are GMT -5. The time now is 04:25 PM.