[SOLVED] Create columns for each section of a line following the comma

keif · 02-27-2014, 11:33 AM

Hello all,

I have a problem separating data in the lines below. Normally, in order to get a single column I would do something like:

Code:

awk '{print $1}'

The problem with that is, the lines in the file below cannot be separated by column.

So I need to find a way to separate the sections of the lines into columns so it'll be easier to distribute the data.

If anyone has advice on how I can put every string of characters that resides before each comma into a column, and remove the comma, it would be greatly appreciated.

Code:

2,1393516939,1,6005411421404424578,0,762,3
3,1393516905,1,6005411421405090576,0,4830,3
25,1393516975,1,6005411420255300952,0,0,3
1,1393516861,3,6005411420240423232,0,13516,3
9,1393516945,1,6005411421420585154,70,70,3
1,1393517058,1,6005411421404433942,0,0,3
7,1393517036,6,6005411420217366497,0,7595,3
3,1393517053,1,6005411421404416707,0,14386,3
9,1393517090,3,6005411420245755810,0,5241,3
8,1393516890,1,6005411421404436234,0,26552,3
1,1393517103,1,6005411421401195114,0,287,3
3,1393517153,1,6005411421430297097,0,0,3
1,1393517206,1,6005411421428406379,0,0,3
2,1393517080,1,6005411421401184431,0,0,3
7,1393517104,24,6005411421421405923,0,1229,3
3,1393517266,1,6005411421404444238,0,0,3
3,1393517320,9,6005411421422787410,5,5,3
8,1393517154,17,6005411421404371936,0,78,3
2,1393517273,1,6005411421404434783,0,4553,3
1,1393517271,29,6005411421428498194,0,20,3
3,1393517394,22,6005411421422787410,5,5,3
8,1393517424,1,6005411421404440558,0,4603,3
7,1393517288,1,6005411421401181726,0,0,3
1,1393517469,2,6005411421426871137,0,0,3
2,1393517436,2,6005411421405090626,0,30,3
27,1393517312,1,6005411420237455817,0,130,3
7,1393517493,2,6005411421430881585,0,0,3
3,1393517496,1,6005411421406976625,0,9262,3
2,1393517517,1,6005411421431197239,0,241,3
7,1393517558,6,6005411421406989560,0,1049,3
3,1393517593,3,6005411421411910445,0,0,3
7,1393517622,6,6005411421400938787,0,3099,3
3,1393517661,1,6005411421418123919,0,0,3
2,1393517589,1,6005411421404371704,0,0,3
8,1393517501,1,6005411421412875571,0,103,3
7,1393517702,3,6005411421431198013,0,0,3
2,1393517766,4,6005411421404434254,0,458,3
3,1393517734,3,6005411421428407294,0,0,3
7,1393517814,4,6005411421401193838,0,5791,3

danielbmartin · 02-27-2014, 12:02 PM

With this InFile ...

Code:

2,1393516939,1,6005411421404424578,0,762,3
3,1393516905,1,6005411421405090576,0,4830,3
25,1393516975,1,6005411420255300952,0,0,3
1,1393516861,3,6005411420240423232,0,13516,3
9,1393516945,1,6005411421420585154,70,70,3
1,1393517058,1,6005411421404433942,0,0,3

... this column ...

Code:

column -ts, $InFile >$OutFile

... produced this OutFile ...

Code:

2   1393516939  1  6005411421404424578  0   762    3
3   1393516905  1  6005411421405090576  0   4830   3
25  1393516975  1  6005411420255300952  0   0      3
1   1393516861  3  6005411420240423232  0   13516  3
9   1393516945  1  6005411421420585154  70  70     3
1   1393517058  1  6005411421404433942  0   0      3

Daniel B. Martin

keif · 02-27-2014, 01:35 PM

Quote:

Originally Posted by danielbmartin

With this InFile ...

Code:

2,1393516939,1,6005411421404424578,0,762,3
3,1393516905,1,6005411421405090576,0,4830,3
25,1393516975,1,6005411420255300952,0,0,3
1,1393516861,3,6005411420240423232,0,13516,3
9,1393516945,1,6005411421420585154,70,70,3
1,1393517058,1,6005411421404433942,0,0,3

... this column ...

Code:

column -ts, $InFile >$OutFile

... produced this OutFile ...

Code:

2   1393516939  1  6005411421404424578  0   762    3
3   1393516905  1  6005411421405090576  0   4830   3
25  1393516975  1  6005411420255300952  0   0      3
1   1393516861  3  6005411420240423232  0   13516  3
9   1393516945  1  6005411421420585154  70  70     3
1   1393517058  1  6005411421404433942  0   0      3

Daniel B. Martin

Thank you Daniel. That was exactly what I was looking for.

grail · 02-27-2014, 07:54 PM

Quote:

The problem with that is, the lines in the file below cannot be separated by column.

Not true, you simply need to alter your original awk:

Code:

awk -F, '{print $1}' file

However, if the above result provided by Daniel is what is required, that is the best way to process it

danielbmartin · 02-27-2014, 08:16 PM

Quote:

Originally Posted by grail

Code:

awk -F, '{print $1}' file

This delivers only column 1. I interpreted OP's requirement as being for all columns.

If only column 1 is needed I would use cut. Fewer keystrokes.

Code:

cut -d, -f1 $InFile >$OutFile

Daniel B. Martin

szboardstretcher · 02-27-2014, 08:24 PM

Or if you have a very evil computer science prof that wants you to do everything with shell builtins:

Code:

IFS="," read -r col1 col2 col3 col4 col5 col6 col7 <<< "2,1393516939,1,6005411421404424578,0,762,3"
printf "$col1 $col2 $col3 $col4 $col5 $col6 $col7"

danielbmartin · 02-27-2014, 08:25 PM

Post #2 in this thread proposed using column. That works nicely but all data is left-justified. To my eye, right-justified values make a more attractive table (assuming that's what you really want). Column doesn't have a shift-right option. (Pity!) A Google search turned up an awk which does the job. The code isn't pretty but it works.

With this InFile ...

Code:

2,1393516939,1,6005411421404424578,0,762,3
3,1393516905,1,6005411421405090576,0,4830,3
25,1393516975,1,6005411420255300952,0,0,3
1,1393516861,3,6005411420240423232,0,13516,3
9,1393516945,1,6005411421420585154,70,70,3
1,1393517058,1,6005411421404433942,0,0,3

... this awk ...

Code:

awk -F, '
NR==FNR {for (i=1;i<=NF;i++)
  if (w[i]<length($i)) w[i]=length($i); next}
        {for (i=1;i<=NF;i++)
  printf "%*s",w[i]+(i>1?1:0),$i; print ""}' \
  $InFile $InFile >$OutFile

... produced this OutFile ...

Code:

 2 1393516939 1 6005411421404424578  0   762 3
 3 1393516905 1 6005411421405090576  0  4830 3
25 1393516975 1 6005411420255300952  0     0 3
 1 1393516861 3 6005411420240423232  0 13516 3
 9 1393516945 1 6005411421420585154 70    70 3
 1 1393517058 1 6005411421404433942  0     0 3

Daniel B. Martin

szboardstretcher · 02-27-2014, 08:32 PM

Just guessing at the values here,. but this should do somewhat the same thing as that awk program, if right justify is really what is needed.

Code:

IFS="," read -r col1 col2 col3 col4 col5 col6 col7 <<< "2,1393516939,1,6005411421404424578,0,762,3"
printf "%1d $col1 %12d $col2 %1d $col3 %18d $col4 %1d $col5 %8d $col6 %1d $col7 \n"

danielbmartin · 02-27-2014, 09:01 PM

Quote:

Originally Posted by szboardstretcher

Just guessing at the values here,. but this should do somewhat the same thing as that awk program, if right justify is really what is needed.

Code:

IFS="," read -r col1 col2 col3 col4 col5 col6 col7 <<< "2,1393516939,1,6005411421404424578,0,762,3"
printf "%1d $col1 %12d $col2 %1d $col3 %18d $col4 %1d $col5 %8d $col6 %1d $col7 \n"

I don't take credit for the awk; I found it with a Google search. It's strength is that it is generalized. The user need not know the number columns in the input file or the width needed to print each column.

Daniel B. Martin

grail · 02-27-2014, 11:30 PM

Quote:

Originally Posted by danielbmartin

This delivers only column 1. I interpreted OP's requirement as being for all columns.

If only column 1 is needed I would use cut. Fewer keystrokes.

Code:

cut -d, -f1 $InFile >$OutFile

Daniel B. Martin

There are of course easier / better options, but I was using the OPs original input as they seemed to be saying that the awk being used was no longer able to provide output

PTrenholme · 02-28-2014, 03:44 PM

As a variation on the code in post by Mr. Martin (#7, above), consider this gawk script. (Note that this uses gawk extensions.)
My point is to illustrate the additional possibilities provided by gawk (well, actually that part could be done Posix awk).
Anyhow:

Code:

#!/bin/gawk -f
# Convert a CSV file to a fixed field length file
#
# Usage: convert {file1 {file2 ... fileN} output_file}
# 
# where input defaults to stdin and output to stdout
#
# Notes: This program stores all the input data an array
#        available for processing, if needed, before the 
#        input files are written.
#
#        The number of fields is NOT assumed to be the same
#        in each record. Since the data array is two-dimensional,
#        the number of fields in the i-th input line is length(data[i]),
#        and those field values are data[i][j], j=1...length(data[i]).
########################################################################
#
# Initialization
BEGIN {
  # Set the field seperator to a comma
  FS=","
  # If more than one file is specified on the input
  # line, and OutFile is not defined, assume that
  # the last entry is the output file name.
  if ((! OutFile) && (ARGC > 2)) {
    OutFile=ARGV[ARGC-1]
    --ARGC
  }
  else {
    OutFile="/dev/stdout"
  }
}

{
  ++record_count
  if (field_count < NF) {
    field_count = NF
  }
  for (i=1;i<=NF;i++) {
    if (w[i]<length($i)) {
      w[i]=length($i)
    }
    data[record_count][i]=$i
  }
}

# Add an extra blank before each field except the first
END {
  for (i=2; i<=field_count; ++i) {
    ++w[i]
  }
}

# Write the output
END {
  for (j=1; j<=record_count; ++j) {
    for (i=1;i<=length(data[j]);++i) {
      printf("%" w[i] "s", data[j][i]) > OutFile
    }
    printf "\n" > OutFile
  }
}

# And, to illustrate other possibilities, some statistics ...
END {
  for (j=1; j<=record_count; ++j) {
    for (i=1; i<=length(data[j]);++i) {
      datum=0+data[j][i]
      if (datum != data[j][i]) {
        continue # Skip any non-numeric values
      }
      ++n[i]
      if (datum > max[i]) {
        max[i]=datum
      }
      if (min[i] > datum) {
        min[i]=datum
      }
      sum[i]+=datum
      ss[i]+=datum*datum
      ++freq[i][datum]
    }
  }
  for (i=1; i<=field_count; ++i) {
    if (n[i]>0) {
      mean=sum[i]/n[i]
      nv=asorti(freq[i],asc)
      if (nv==1) {
        printf("Column %d: All values are equal to %d.\n", i, asc[1])
        continue
      }
      median=asc[nv]
      std=((ss[i]-(mean*mean))/(n[i]-1))**0.5
      printf("Column %d: Average=%f, Meadian=%f, Standard Error=%f\n", i, mean, median, std)
    }
    else {
      printf("Column %d had no numeric entries.\n", i)
    }
  }
}

which, after being saved as "columize" and made executable, produces:

Code:

$ ./columize test_data test_out
Column 1: Average=5.333333, Meadian=1.000000, Standard Error=7.711160
Column 2: Average=1393517309.000000, Meadian=1393516861.000000, Standard Error=1393517309.000027
Column 3: Average=4.333333, Meadian=1.000000, Standard Error=8.062620
Column 4: Average=6005411421262476288.000000, Meadian=6005411420217366528.000000, Standard Error=6005411421262476288.000000
Column 5: Average=2.051282, Meadian=0.000000, Standard Error=11.408437
Column 6: Average=2663.974359, Meadian=0.000000, Standard Error=6005.381993
Column 7: All values are equal to 3.

with test_out as follows:

Code:

$ cat test_out 
 2 1393516939  1 6005411421404424578  0   762 3
 3 1393516905  1 6005411421405090576  0  4830 3
25 1393516975  1 6005411420255300952  0     0 3
 1 1393516861  3 6005411420240423232  0 13516 3
 9 1393516945  1 6005411421420585154 70    70 3
 1 1393517058  1 6005411421404433942  0     0 3
 7 1393517036  6 6005411420217366497  0  7595 3
 3 1393517053  1 6005411421404416707  0 14386 3
 9 1393517090  3 6005411420245755810  0  5241 3
 8 1393516890  1 6005411421404436234  0 26552 3
 1 1393517103  1 6005411421401195114  0   287 3
 3 1393517153  1 6005411421430297097  0     0 3
 1 1393517206  1 6005411421428406379  0     0 3
 2 1393517080  1 6005411421401184431  0     0 3
 7 1393517104 24 6005411421421405923  0  1229 3
 3 1393517266  1 6005411421404444238  0     0 3
 3 1393517320  9 6005411421422787410  5     5 3
 8 1393517154 17 6005411421404371936  0    78 3
 2 1393517273  1 6005411421404434783  0  4553 3
 1 1393517271 29 6005411421428498194  0    20 3
 3 1393517394 22 6005411421422787410  5     5 3
 8 1393517424  1 6005411421404440558  0  4603 3
 7 1393517288  1 6005411421401181726  0     0 3
 1 1393517469  2 6005411421426871137  0     0 3
 2 1393517436  2 6005411421405090626  0    30 3
27 1393517312  1 6005411420237455817  0   130 3
 7 1393517493  2 6005411421430881585  0     0 3
 3 1393517496  1 6005411421406976625  0  9262 3
 2 1393517517  1 6005411421431197239  0   241 3
 7 1393517558  6 6005411421406989560  0  1049 3
 3 1393517593  3 6005411421411910445  0     0 3
 7 1393517622  6 6005411421400938787  0  3099 3
 3 1393517661  1 6005411421418123919  0     0 3
 2 1393517589  1 6005411421404371704  0     0 3
 8 1393517501  1 6005411421412875571  0   103 3
 7 1393517702  3 6005411421431198013  0     0 3
 2 1393517766  4 6005411421404434254  0   458 3
 3 1393517734  3 6005411421428407294  0     0 3
 7 1393517814  4 6005411421401193838  0  5791 3