LinuxQuestions.org - How to scan a file with 2 different field separators?

- Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)

- - How to scan a file with 2 different field separators? (https://www.linuxquestions.org/questions/linux-software-2/how-to-scan-a-file-with-2-different-field-separators-526174/)

How to scan a file with 2 different field separators?

I have a linux assignment - I'm not asking for any code, just an idea - and the task is to scan a text file and sort it first by one field separator, than every field to sort it by another separator. Awk looks like a good choice but I don't know how to use 2 FS in one awk script or to call an awk script from another awk script.

If the scanning process is sequential (that is the output of the first scanning represents in toto the input for the second scanning) you can pipe two awk calls, as in

Code:

awk -F, 'some_awk_code_here' filename | awk -F: 'some_other_awk_code_here'

In this example the option -F tells what separator is being used ("," in the first call, ":" in the second one). For a more sofisticated interaction between I/O, please post an example of what has to be done. It will be my pleasure to help!

For the hell of it, here's another suggestion. We'll go with the same assumption -- that one delimiter is a comma and the other is a colon.

First, run it through tr to change all delimiters to the same type:

Code:

tr ',' ':' < some-data-file

After that, all commas will be translated to colons. So you can use colucix's awk invocation above, except only for colons. (In other words, once everything has been changed to a single delimiter, life becomes easier.)

[ caveat: It's entirely possible I am misunderstanding what you are trying to do. I'm not too sure what you mean when you say "sort by separator". ]

Here's an extract from the info file for sort which may suggest a way to accomplish your task.

Code:

  * Sort a set of log files, primarily by IPv4 address and secondarily

    by time stamp.  If two lines' primary and secondary keys are

    identical, output the lines in the same order that they were

    input.  The log files contain lines that look like this:



          4.150.156.3 - - [01/Apr/2004:06:31:51 +0000] message 1

          211.24.3.231 - - [24/Apr/2004:20:17:39 +0000] message 2



    Fields are separated by exactly one space.  Sort IPv4 addresses

    lexicographically, e.g., 212.61.52.2 sorts before 212.129.233.201

    because 61 is less than 129.



          sort -s -t ' ' -k 4.9n -k 4.5M -k 4.2n -k 4.14,4.21 file*.log |

          sort -s -t '.' -k 1,1n -k 2,2n -k 3,3n -k 4,4n



    This example cannot be done with a single `sort' invocation, since

    IPv4 address components are separated by `.' while dates come just

    after a space.  So it is broken down into two invocations of

    `sort': the first sorts by time stamp and the second by IPv4

    address.  The time stamp is sorted by year, then month, then day,

    and finally by hour-minute-second field, using `-k' to isolate each

    field.  Except for hour-minute-second there's no need to specify

    the end of each key field, since the `n' and `M' modifiers sort

    based on leading prefixes that cannot cross field boundaries.  The

    IPv4 addresses are sorted lexicographically.  The second sort uses

    `-s' so that ties in the primary key are broken by the secondary

    key; the first sort uses `-s' so that the combination of the two

    sorts is stable.

Quote:

Originally Posted by colucix

If the scanning process is sequential (that is the output of the first scanning represents in toto the input for the second scanning) you can pipe two awk calls, as in

Code:

awk -F, 'some_awk_code_here' filename | awk -F: 'some_other_awk_code_here'

excellent idea colucix. is there a way to implement this into one awk script?
anomie, I cannot transform every separator because I need different functions to be run on different fields.
PTrenholme, also a good idea, but I don't want to sort the fields.
thank you all.

Quote:

Originally Posted by cdog

is there a way to implement this into one awk script?

Do you mean to implement a single script to perform a single call to awk? The answer is always YES, since awk is a very powerful scripting language. However in this case i suggest to specify the first Field Separator in the BEGIN section of the script, e.g.

Code:

BEGIN { FS = "," }

then we can process each field by splitting it in other subfield by means of the split function, e.g.

Code:

split($2,names,":")

this is just an example which split the 2nd field using ":" as separator and assigning the splitted fields to the array "names". Then you can do some other processing on each element of the array.
A correct answer to your question requires the knowledge of the task you have to accomplish. By the way, this is a general issue for the great awk language!

Hi,

(GNU) Awk accepts multiple separators.

awk -F",|:" '{ .........}' infile

The , and the : are used as separator.

Code:

$ cat infile 

foo,bar:foobar,barfoo:end



$ awk -F":|," '{ print $2, $5 }' infile 

bar end

Hope this helps.

thanks guys but I cannot use your ideas:
colucix: I need the fields inside the big field in order and using arrays I cannot acomplish this.
druuna: I need to ditinguish between the fields separated by ":" and the ones separated by ","

cdog,

Post some sample data and how you want the results to come out. That'll make this less ambiguous and get you help quicker.

Hi,

If you need to distinguish between the field separators, it seems that PTrenholme gave the answer (post #4). Sort can do this.

Man sort or info sort for details.

druuna I don't want to sort the input. is there a way to use sort with its options and not sort the input?
anomie, here is an example: january,february,june:sunday,saturday,monday. the output will be 1,2,6:1,7,2; something like that

colucix, I take it back your idea works, runnning throw the array using for(index in array) does not get the elements in order but using for (i=1;i=array_size;i++) does. thanks

Hi,

If your input has a fixed layout you could use something like this (shortened, but you probably get the idea):

Code:

#!/bin/bash



awk '

  BEGIN {

          FS = "[,:]"



          # Fill month array with months/number pairs

          month["january"] = "1"; month["february"] = "2"

          month["june"] = "6"



          # Fill week array with week/number pairs

          week["sunday"] = "1"  ; week["monday"] = "2"

          week["saturday"] = "7"

  }

  {

    print month[$1]","month[$2]","month[$3]":"week[$4]","week[$5]","week[$6]

  } ' infile

Hope this helps.

druuna the input is not fix, but I managed to solve it using something similar. thanks.