LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (http://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Need a second pair of eyes to check this code for CSV file (http://www.linuxquestions.org/questions/linux-newbie-8/need-a-second-pair-of-eyes-to-check-this-code-for-csv-file-762344/)

shayno90 10-16-2009 05:56 AM

Need a second pair of eyes to check this code for CSV file
 
Hi, just got code to work that pulls out the fields I need from a file, 6 fields to be specific. Next I have to convert 3 of these fields.


'Mop-21050905','auth','info','info','26','2009-10-01 05:00:01','snort','snort[9861]: [1:254:7] DNS SPOOF query response with TTL of 1 min and no authority [Classification: Potentially Bad Traffic] [Priority: 2]: {UDP} 20867222222:53 -> 9522496106:58846',7119397

'Mop-21050905','auth','info','info','26','2009-10-01 07:59:09','snort','snort[9861]: [1:7861:1] POLICY Google Desktop activity [Classification: Potential Corporate Privacy Violation] [Priority: 1]: {TCP} 9522496106:43418 -> 7412553136:80',7123988

'Mop-21050905','auth','info','info','26','2009-10-01 08:02:22','snort','snort[9861]: [1:13864:1] POLICY Microsoft Watson error reporting attempt [Classification: Potential Corporate Privacy Violation] [Priority: 1]: {TCP} 9522496106:34663 -> 655553190:80',7124307

Using this code

awk -F"," '{ split($8,tmp,"{"); print $6 "," tmp[2] }' filename.txt | sed -e "s/} /,/g" -e "s/ -> /,/g" | tr ":" "," | tr -d "'"

gives

2009-10-09 12,45,18,ICMP,66102999,9522496106
2009-10-09 12,45,20,UDP,202103951,1096,9522496106,1434
2009-10-09 12,45,20,UDP,202103951,1096,9522496106,1434

which is what I want, however I cannot convert the datetime field (1st field) since the colon is removed from the time. It was removed in order to separate the IP address from its port number. (used to be 202103951:1096 now is 202103951,1096)

I need to use a sed command to put the time back to 12:45:18 and not 12,45,18
Tried using
sed -e 's/.^/\d+[0-9]/\d+[0-9]/\d+[0-9]/:/' but no luck:(
Any ideas using sed which will fit with my original command?

rizhun 10-16-2009 07:29 AM

Quick & Dirty:

Code:

$ sed -e 's/,/:/' -e 's/,/:/' your_data_file.txt
:)

shayno90 10-16-2009 09:31 AM

Thanks rizhun, it worked when I piped it onto the original command

awk -F"," '{ split($8,tmp,"{"); print $6 "," tmp[2] }' filename | sed -e "s/} /,/g" -e "s/ -> /,/g" | tr ":" "," | tr -d "'" | sed -e 's/,/:/' -e 's/,/:/' > filename.txt

Gives

2009-10-01 08:06:08,TCP,62.149.128.161,110,95.224.96.106,1047
2009-10-01 08:06:08,TCP,62.149.128.161,110,95.224.96.106,1047
2009-10-01 08:06:17,TCP,62.149.128.167,110,95.224.96.106,1053
2009-10-01 08:06:17,TCP,62.149.128.167,110,95.224.96.106,1053

Now I have another question, using this new file how can I get the 1st field of datetime to convert in each line using date -d "$1" +%s, maybe xarg or awk?
e.g.
xargs -a filename.txt -I date -d "$1" +%s

rizhun 10-16-2009 09:52 AM

You didn't specify whether the epoch date was an addition or replacement of the existing date, so I'm going to presume it's an addition:

Code:

cat datafile.txt | while read dataline
do
  date=$(echo ${dateline} | awk -F"," '{ print $1 }')
  epochdate=$(date -d "${date}" +%s)
  echo "${epochdate},${dataline}" >> newdatafile.txt
done


shayno90 10-20-2009 06:30 AM

hi rizhun, tried your code but I want to replace the original datetime with the new datetime after the datetime code is executed. Also the datetime code doesn't change the datetime for each field but only outputs the same integer as below

1255993200,2009-10-01 05:00:01,UDP,208.67.222.222,53,95.224.96.106,58846
1255993200,2009-10-01 07:59:09,TCP,95.224.96.106,43418,74.125.53.136,80
1255993200,2009-10-01 13:21:01,ICMP,83.237.90.142,95.224.96.106

So I want the datetime to look like this date (using this -d "2009-10-09 11:58:59" +%s should be date -d "$1" +%s)

1255993289,UDP,208.67.222.222,53,95.224.96.106,58846
1255389290,TCP,95.224.96.106,43418,74.125.53.136,80
1283202828,ICMP,83.237.90.142,95.224.96.106

So if you or anyone else can suggest how to change it, would be great.

rizhun 10-21-2009 03:57 AM

Hi shayno90,

There was a type-o in my code, try this:

Code:

# original code with fixed type-o
cat datafile.txt | while read dataline
do
  date=$(echo ${dataline} | awk -F"," '{ print $1 }')
  epochdate=$(date -d "${date}" +%s)
  echo "${epochdate},${dataline}" >> newdatafile.txt
done

# awk to remove unwanted second column
awk 'BEGIN{FS=",";OFS=","}{$2="";gsub(FS "+",FS)}1' newdatafile.txt > finished-data-file.txt


shayno90 10-21-2009 04:57 AM

Thanks rizhun the code outputted the data correctly :)
Now I am trying to convert the 2 ip address fields into integers for example

echo 194.169.240.130 | perl -ne 's/(\d+)\.(\d+)\.(\d+)\.(\d+)/$1<<24|$2<<16|$3<<8|$4/e;print' (this gives the integer below)
3265917058

So I need to convert and substitute fields 2 and 4 from ip addresses to integers

1255993200,ICMP,86.47.191.211,95.224.96.106 ----------->1255993200,ICMP,3265917058,2261917067
1255993200,TCP,95.224.96.106,36491,65.55.53.190,80 ----------->1255993200,TCP,13259170432,36491,3234817345,80
1255993200,UDP,208.67.222.222,53,95.224.96.106,59329 ----------->1255993200,UDP,5875912358,53,3290917028,59329

Could changing your code like this work

cat finishedprotocol.txt | while read dataline
do
ipaddress=$(echo ${dataline} | awk -F"," '{ print $3, print$5 }')
newipaddress=$(ipaddress perl -ne 's/(\d+)\.(\d+)\.(\d+)\.(\d+)/$1<<24|$2<<16|$3<<8|$4/e;print')
echo "${newipaddress},${dataline}" >> newerprotocol.txt
done

# awk to remove unwanted third and fifth columns
awk 'BEGIN{FS=",";OFS=","}{$3="";gsub(FS "+",FS)}3{$5="";gsub(FS "+",FS)}5' newerprotocol.txt > newestprotocol.txt

A big problem is that protocols TCP AND UDP have 6 fields so code would only work for them while protocol ICMP has only 4 fields (due to no ports for it)
i.e only the first ip address (field 3) will be converted and not the 2nd ip address (field 4) :(

Any workaround for this?

rizhun 10-21-2009 05:49 AM

Hmmm...

Try the following (not tested):

Code:

cat finishedprotocol.txt | while read dataline
do

  # find out which protocol we're dealing with
  datatype=$(echo ${dataline} | awk -F"," '{ print $2 }')

  # only 1 ip for icmp
  [[ ${datatype} == "ICMP" ]] && {
    # grab the ip
    ipaddress=$(echo ${dataline} | awk -F"," '{ print $3 }')
    # convert the ip
    newipaddress=$(echo ${ipaddress} | perl -ne 's/(\d+)\.(\d+)\.(\d+)\.(\d+)/$1<<24|$2<<16|$3<<8|$4/e;print')
    # remove old ip column
    newdataline=$(echo ${dataline} | awk 'BEGIN{FS=",";OFS=","}{$3="";gsub(FS "+",FS)}1')
    # print new 'dataline' to new text file
    echo "${newipaddress},${newdataline}" >> newerprotocol.txt
    # start from the top of the loop
    continue
  }

  # anything except icmp; grab ip's
  ipaddone=$(echo ${dataline} | awk -F"," '{ print $3 }')
  ipaddtwo=$(echo ${dataline} | awk -F"," '{ print $5 }')

  # convert
  newipone=$(echo ${ipaddone} | perl -ne 's/(\d+)\.(\d+)\.(\d+)\.(\d+)/$1<<24|$2<<16|$3<<8|$4/e;print')
  newiptwo=$(echo ${ipaddtwo} | perl -ne 's/(\d+)\.(\d+)\.(\d+)\.(\d+)/$1<<24|$2<<16|$3<<8|$4/e;print')

  # remove old columns
  newdataline=$(echo ${dataline} | awk 'BEGIN{FS=",";OFS=","}{$3=$5="";gsub(FS "+",FS)}1')

  # print new 'dataline' to new text file
  echo "${newipone},${newiptwo},${newdataline}" >> newerprotocol.txt

done


shayno90 10-21-2009 06:15 AM

Thanks again rizhun just made a small change to the code to convert the 2nd ip address (4th field) of icmp also like so

cat finishedprotocol.txt | while read dataline
do

# find out which protocol we're dealing with
datatype=$(echo ${dataline} | awk -F"," '{ print $2 }')

# convert 2 ip addresses for icmp
[[ ${datatype} == "ICMP" ]] && {
# grab the ip
ipaddress1=$(echo ${dataline} | awk -F"," '{ print $3 }')
ipaddress2=$(echo ${dataline} | awk -F"," '{ print $4 }')
# convert the ip
newipaddress1=$(echo ${ipaddress1} | perl -ne 's/(\d+)\.(\d+)\.(\d+)\.(\d+)/$1<<24|$2<<16|$3<<8|$4/e;print')
newipaddress2=$(echo ${ipaddress2} | perl -ne 's/(\d+)\.(\d+)\.(\d+)\.(\d+)/$1<<24|$2<<16|$3<<8|$4/e;print')
# remove old ip column
newdataline=$(echo ${dataline} | awk 'BEGIN{FS=",";OFS=","}{$3=$4"";gsub(FS "+",FS)}1')
# print new 'dataline' to new text file
echo "${newipaddress1},${newipaddress2},${newdataline}" >> newerprotocol.txt
# start from the top of the loop
continue
}

# anything except icmp; grab ip's
ipaddone=$(echo ${dataline} | awk -F"," '{ print $3 }')
ipaddtwo=$(echo ${dataline} | awk -F"," '{ print $5 }')

# convert
newipone=$(echo ${ipaddone} | perl -ne 's/(\d+)\.(\d+)\.(\d+)\.(\d+)/$1<<24|$2<<16|$3<<8|$4/e;print')
newiptwo=$(echo ${ipaddtwo} | perl -ne 's/(\d+)\.(\d+)\.(\d+)\.(\d+)/$1<<24|$2<<16|$3<<8|$4/e;print')

# remove old columns
newdataline=$(echo ${dataline} | awk 'BEGIN{FS=",";OFS=","}{$3=$5="";gsub(FS "+",FS)}1')

# print new 'dataline' to new text file
echo "${newipone},${newiptwo},${newdataline}" >> newestprotocol.txt

done

output is like this

3670927203,1608540266,1255993200,UDP,4244,1434
3494108894,1608540266,1255993200,UDP,53,58846
1608540266,1249719688,1255993200,TCP,43418,80
1608540266,1094137278,1255993200,TCP,34663,80
1049985185,1608540266,1255993200,TCP,110,1047

however only the UDP and TCP datalines have printed not the ICMP datalines and the positon of the protocol and port fields have moved to the end of the line, maybe I haven't included the ICMP to print in the "newestprotocol" file?)
any ideas why this occurred?

rizhun 10-21-2009 06:24 AM

You've got 2 different output files in your code:

newerprotocol.txt
and
newestprotocol.txt

Change one so they are both the same.

Any column re-arranging can be done by using 'awk' to print each field into a variable, then 'echo' those variables out in the order you require.

shayno90 10-21-2009 07:12 AM

Ya sorted the output by changing the file name, now trying to rearrange the field's positions but code is just hanging

cat newerprotocol.txt | while read dataline
do

field1=$(echo ${dataline} | awk -F"," '{ print $1 }')
field2=$(echo ${dataline} | awk -F"," '{ print $4 }')
field3=$(echo ${dataline} | awk -F"," '{ print $2 }')
field4=$(echo ${dataline} | awk -F"," '{ print $5 }')
field5=$(echo ${dataline} | awk -F"," '{ print $3 }')
field6=$(echo ${dataline} | awk -F"," '{ print $6 }')

echo "${field1},${field2},${field3},${field4},${field5},${field6}" >> newestprotocol.csv

done

The 4th and 6th field still appear in the ICMP lines so just wondering what else I'm missing in this code? Also the code appears to hang when it reaches "done" and won't execute until you hit the enter key, any thoughts on this?

shayno90 10-23-2009 09:53 AM

Another issue:

The 4th and 6th field still appear in the ICMP lines so just wondering what else I'm missing in this code?

Also the previous code in the previous reply appears to hang when it reaches "done" and won't execute until you hit the enter key, any thoughts on this?

lutusp 10-23-2009 03:11 PM

Quote:

Originally Posted by shayno90 (Post 3721516)
Hi, just got code to work that pulls out the fields I need from a file, 6 fields to be specific. Next I have to convert 3 of these fields.


'Mop-21050905','auth','info','info','26','2009-10-01 05:00:01','snort','snort[9861]: [1:254:7] DNS SPOOF query response with TTL of 1 min and no authority [Classification: Potentially Bad Traffic] [Priority: 2]: {UDP} 20867222222:53 -> 9522496106:58846',7119397

'Mop-21050905','auth','info','info','26','2009-10-01 07:59:09','snort','snort[9861]: [1:7861:1] POLICY Google Desktop activity [Classification: Potential Corporate Privacy Violation] [Priority: 1]: {TCP} 9522496106:43418 -> 7412553136:80',7123988

'Mop-21050905','auth','info','info','26','2009-10-01 08:02:22','snort','snort[9861]: [1:13864:1] POLICY Microsoft Watson error reporting attempt [Classification: Potential Corporate Privacy Violation] [Priority: 1]: {TCP} 9522496106:34663 -> 655553190:80',7124307

Using this code

awk -F"," '{ split($8,tmp,"{"); print $6 "," tmp[2] }' filename.txt | sed -e "s/} /,/g" -e "s/ -> /,/g" | tr ":" "," | tr -d "'"

gives

2009-10-09 12,45,18,ICMP,66102999,9522496106
2009-10-09 12,45,20,UDP,202103951,1096,9522496106,1434
2009-10-09 12,45,20,UDP,202103951,1096,9522496106,1434

which is what I want, however I cannot convert the datetime field (1st field) since the colon is removed from the time. It was removed in order to separate the IP address from its port number. (used to be 202103951:1096 now is 202103951,1096)

I need to use a sed command to put the time back to 12:45:18 and not 12,45,18
Tried using
sed -e 's/.^/\d+[0-9]/\d+[0-9]/\d+[0-9]/:/' but no luck:(
Any ideas using sed which will fit with my original command?


Code:

awk -F"," '{ split($8,tmp,"{"); print $6 "," tmp[2] }' filename.txt | sed -e "s/} /,/g" -e "s/ -> /,/g" | tr ":" "," | tr -d "'"
Please, will you people stop writing these one-line wonders! They are a waste of time, they identify rank amateur coders, they start out incomprehensible and unworkable and go downhill from there, and they beg to be thrown away.

Do it this way:

Code:

rn=1
cat data.txt | while read line
do
  echo "Record $rn:"
  fn=1
  echo "$line" | tr ',' '\n' | while read field
  do
      echo -e "\tField $fn: $field"
      ((fn++))
  done
  ((rn++))
done

Now it must be obvious that you can use the same coding pattern to break the fields down in the same way, down to individual characters if need be, to accomplish any earthly objective.

The advantages of this script is that you can understand it, change it, and use it again tomorrow for some other data processing task.

The advantage of the one-liner is that you can baffle onlookers with your imagined intellectual prowess, but as to accomplishing anything useful, forget it.

When you write a script, you have some options:

1. Use sed
2. Use awk
3. Use one-liners
4. Use find followed by either xargs or "-exec ... '{}' \;"
5. Use your head.

I recommend option 5.

While using your head, I invite you to find the flaw in my script, that will prevent it from working with just any arbitrary comma-separated database.

(pause ...)

Okay, time's up. If there are any embedded commas within the data fields, my script will fail. But this can be dealt with by using a smart parser that is only slightly more complex than the above script. Fair warning.

shayno90 10-27-2009 11:46 AM

Quote:

Originally Posted by lutusp (Post 3730222)
Code:

awk -F"," '{ split($8,tmp,"{"); print $6 "," tmp[2] }' filename.txt | sed -e "s/} /,/g" -e "s/ -> /,/g" | tr ":" "," | tr -d "'"
Please, will you people stop writing these one-line wonders! They are a waste of time, they identify rank amateur coders, they start out incomprehensible and unworkable and go downhill from there, and they beg to be thrown away.

Do it this way:

Code:

rn=1
cat data.txt | while read line
do
  echo "Record $rn:"
  fn=1
  echo "$line" | tr ',' '\n' | while read field
  do
      echo -e "\tField $fn: $field"
      ((fn++))
  done
  ((rn++))
done

Now it must be obvious that you can use the same coding pattern to break the fields down in the same way, down to individual characters if need be, to accomplish any earthly objective.

The advantages of this script is that you can understand it, change it, and use it again tomorrow for some other data processing task.

The advantage of the one-liner is that you can baffle onlookers with your imagined intellectual prowess, but as to accomplishing anything useful, forget it.

When you write a script, you have some options:

1. Use sed
2. Use awk
3. Use one-liners
4. Use find followed by either xargs or "-exec ... '{}' \;"
5. Use your head.

I recommend option 5.

While using your head, I invite you to find the flaw in my script, that will prevent it from working with just any arbitrary comma-separated database.

(pause ...)

Okay, time's up. If there are any embedded commas within the data fields, my script will fail. But this can be dealt with by using a smart parser that is only slightly more complex than the above script. Fair warning.



First of all I was given this code and it works!!!!!

Secondly, it is tough to understand but at least I understand it now more since it has been discussed and work on over a period of weeks (it is clear you are unaware of this!)

Thirdly, your code doesn't work for me so and isn't explained clearly

Fourthly you condescending attitude doesn't belong in a forum like this and you would be better off spending your time in complaining on some blog or twitter!!

ghostdog74 10-27-2009 12:59 PM

Quote:

Originally Posted by lutusp (Post 3730222)
[CODE]

Do it this way:

Code:

rn=1
cat data.txt | while read line
do
  echo "Record $rn:"
  fn=1
  echo "$line" | tr ',' '\n' | while read field
  do
      echo -e "\tField $fn: $field"
      ((fn++))
  done
  ((rn++))
done


no need to use cat. Also can be written another way which is slightly more efficient. Instead of reading the lines and doing tr for each line,
do the tr first
Code:

tr ',' '\n' < "data.txt"  | while ....
do

done

in fact, tr is also not needed since one can set IFS to get fields in just read into array all with bash.

that said, the whole thing above can also be written with just awk, whose file operations are more efficient than a bash while loop.


All times are GMT -5. The time now is 12:12 PM.