awk split single column into multiple columns based on RS

wolverene13 · 10-31-2012, 12:52 PM

Hello,

I have a .csv file full of device configurations, listed in a single column. Each configuration starts with the word "version". I would like to split each configuration off and put them all in another file, one config in each column. I found a solution that puts each configuration in its own file here, using

Code:

awk '{print $0 "version"> "file" NR}' RS='version'  input-file

, but that does not get me to where I need to be, where each config is in a different column in the same file. Can anyone help?

Thanks in advance.

colucix · 10-31-2012, 01:28 PM

Please, could you post a sample of the input file? Have the different configurations the same number of items (lines)?

wolverene13 · 10-31-2012, 03:15 PM

Quote:

Originally Posted by colucix

Please, could you post a sample of the input file? Have the different configurations the same number of items (lines)?

I have attached a sample file - there are two configs in there - keep in mind the configs may be various lengths depending on what is needed for that particular device. I have the file in .csv format, but LQ doesn't allow uploading that file type. Just change the extension and it should be the same as mine.

colucix · 10-31-2012, 04:45 PM

Looking at your sample file, I'm not sure about the desired output. Here is an awk code that takes the various configurations and put them side by side (independently from their length) using a fixed column width (previously computed):

Code:

BEGIN {
  
  RS = "\"    version  \""
  FS = "\n"
  
}

NR > 1 {

  for ( i = 1; i <= NF; i++ )
    i == 1 ? _[i,NR-1] = RS $i : _[i,NR-1] = $i
    
  NF > max_nf ? max_nf = NF : max_nf
    
}

END {

  for ( i = 1; i <= max_nf; i++ ) {
    for ( j = 1; j <= NR-1; j++)
      printf "%-85s", _[i,j]
    printf "\n"
  }
  
}

Moreover I'm not sure about the first two (recurring) lines of your file. Should they appear in the output file?

Finally note that I had to do some conversions to the sample file in order to avoid unpredictable results: I converted it from UTF-16 to UTF-8 encoding and changed the Windows-style line terminators (CR+LF) to Unix-style (LF):

Code:

$ iconv -fUTF-16 -tUTF-8 sample-csv.txt > sample.csv
$ dos2unix sample.csv

Feel free to ask for clarifications about the suggested code.

schneidz · 11-01-2012, 07:37 AM

Quote:

Originally Posted by wolverene13

I have attached a sample file - there are two configs in there - keep in mind the configs may be various lengths depending on what is needed for that particular device. I have the file in .csv format, but LQ doesn't allow uploading that file type. Just change the extension and it should be the same as mine.

the attached doesnt seem to be in comma-separated-values format ?

i would split the file by each line that contains version then paste each file together. (you mite need to work on space padding each line to the length that you want)

wolverene13 · 11-01-2012, 11:30 AM

Quote:

Originally Posted by schneidz

the attached doesnt seem to be in comma-separated-values format ?

i would split the file by each line that contains version then paste each file together. (you mite need to work on space padding each line to the length that you want)

I played with this some more and talked to the guy who gave me the file in the first place, and he said that he got the information by reading each config file and appending it to a csv spreadsheet, which basically creates a long column of configs. Then I realized that it would be much easier to just put each file in it's own column to begin with. So I guess my question really is:

How do you append file contents to a new column in a an existing csv file (when the number of files is unknown and their lengths are different)?

Example:

|Column A | Column B | Column C
|contents of file 1 | contents of file 2 | contents of file 3

schneidz · 11-01-2012, 11:41 AM

^ what i mentioned above is that your input is just plain unformatted ascii text (not sure why you think it is in csv format: http://en.wikipedia.org/wiki/Comma-separated_values ).

edit: i found this via google for splitting the input:

Code:

awk '{print "version"$0 > "config-"++i".txt"}' RS="version"  largefile.csv

then use the paste command will probably yield you best results.

wolverene13 · 11-01-2012, 12:27 PM

Quote:

Originally Posted by schneidz

^ what i mentioned above is that your input is just plain unformatted ascii text (not sure why you think it is in csv format: http://en.wikipedia.org/wiki/Comma-separated_values ).

edit: i found this via google for splitting the input:

Code:

awk '{print "version"$0 > "config-"++i".txt"}' RS="version"  largefile.csv

then use the paste command will probably yield you best results.

I am familiar with CSV files. In my previous post, I indicated that linux Questions will not allow uploading files with .csv as the extension - therefore, I had to change it to a .txt extension, and changing that to .csv on your end should put it back to the way it is on my end. Either way, I'll check the command out and see if it works.

Snark1994 · 11-01-2012, 12:33 PM

I agree with schneidz re: not being a .csv file - are you sure you uploaded the right one?

Code:

[joshua:/tmp o]$ iconv -fUTF-16 -tUTF-8 sample-csv.txt > sample.csv
[joshua:/tmp o]$ head sample.csv
# configuration file
exit all
"    version  ""3.01A07"" sw  ""1.801"""
    configure
        system
"            name  ""mes-leap-400-m-ave"""
"            contact  ""123-456-7890"""
"            location  ""leap,400-m-ave, BLCY 633695"""
#           Inventory
            inventory  1
[joshua:/tmp o]$ hexdump -C sample.csv | head                      
00000000  23 20 63 6f 6e 66 69 67  75 72 61 74 69 6f 6e 20  |# configuration |
00000010  66 69 6c 65 0d 0a 65 78  69 74 20 61 6c 6c 0d 0a  |file..exit all..|
00000020  22 20 20 20 20 76 65 72  73 69 6f 6e 20 20 22 22  |"    version  ""|
00000030  33 2e 30 31 41 30 37 22  22 20 73 77 20 20 22 22  |3.01A07"" sw  ""|
00000040  31 2e 38 30 31 22 22 22  0d 0a 20 20 20 20 63 6f  |1.801"""..    co|
00000050  6e 66 69 67 75 72 65 0d  0a 20 20 20 20 20 20 20  |nfigure..       |
00000060  20 73 79 73 74 65 6d 0d  0a 22 20 20 20 20 20 20  | system.."      |
00000070  20 20 20 20 20 20 6e 61  6d 65 20 20 22 22 6d 65  |      name  ""me|
00000080  73 2d 6c 65 61 70 2d 34  30 30 2d 6d 2d 61 76 65  |s-leap-400-m-ave|
00000090  22 22 22 0d 0a 22 20 20  20 20 20 20 20 20 20 20  |""".."          |
[joshua:/tmp o]$

I see no commas...

schneidz · 11-01-2012, 12:35 PM

^ you still misunderstand although its not that important to your issue.

regardless of the filename, the content inside your file is not in csv format (i.e.: the open-office csv import funtion doesnt separate it out into columns correctly because the file isnt separated by comma according to the value of each feild).

wolverene13 · 11-01-2012, 03:17 PM

Quote:

Originally Posted by schneidz

^ you still misunderstand although its not that important to your issue.

regardless of the filename, the content inside your file is not in csv format (i.e.: the open-office csv import funtion doesnt separate it out into columns correctly because the file isnt separated by comma according to the value of each feild).

Ah, correct - I misunderstood what you were saying - Yes, each word or phrase in the file is not separated by a comma as it normally is in a csv file (i.e. hello,everyone,I,am,on,the,Internet), despite having a "csv" file extension. That was done on my coworker's end merely to allow for putting the configs into Excel later on, and to facilitate the use of columns. I didn't bother asking him why this is a requirement, merely because I'm not involved in the project beyond this point.

Reuti · 11-01-2012, 05:07 PM

csplit can achieve it in a similar way:

Code:

$ csplit sample.csv '%^" *version%' '/^" *version/' '{*}'

With the awk version the quotation marks are no longer balanced.