awk command

cll · 05-18-2012, 08:42 PM

Hi
can any one tell me how to select column from the file that contains specific pattern using awk

field1,field2
1,2
3,4

this is my way
awk -F "," 'NR==1{for(i=1;i<=NF;i++)if($i~/field2/)f[n++]=i}{for(i=0;i<n;i++)printf"%s%s",i?" ":"",$f[i];print""}'

this line work with my, but when I put this line in my shell the and replace "field2" with $i the awk does't replace it with real value

any idea???

Nominal Animal · 05-18-2012, 09:54 PM

Please use [CODE][/CODE] tags around your code and example files to make them more readable.

You can make your awk snippet a bit more readable:

Code:

awk -F ',' -v pattern='field2' '#
    (NR == 1) {
        for (i = 1; i <= NF; i++)
            if ($i ~ pattern)
                field[++fields] = i
        next
    }
    {
        if (fields > 0) {
            printf("%s", $(field[1]))
            for (i = 2; i <= fields; i++)
                printf(" %s", $(field[i]))
            printf("\n")
        }
    }' input-file(s).. > output-file

As you can see, you need to save the regular expression in a variable first -- use the -v var=value option --, then you can use the regular expression match operator ~ with it.

Trying to insert the exact string from the command line is quite hazardous: consider what would happen in the pattern specified was ./) system("rm -rf") # or something along the same lines..

Using a variable like in the example above avoids the risks, because then awk treats the string as a regular expression, whatever it might contain. You cannot "escape" from the regular expression by crafting a nasty pattern.

Other than that, I use the next statement to output the header line, and use one-based indexing for the array, since awk arrays are normally indexed that way. Less risk of confusion.

If you want to use the second positional shell parameter as the pattern, just use -v pattern="$2" instead.

cll · 05-19-2012, 11:12 AM

many thanks, yes I forget to add the -v pattern="$2" to my script
but can I explain you the hole idea, I'm trying to make my script like mysql
I have the file like that:
user_table.txt
1st_name,last_name,birth,gender,field
john,ntcho,1990,m,MBA
louna,hail,1980,f,ITIL
XXX,XXX,XXX,X,XXX

so my script should do like this:

select 1st_name from user_table where field eq mba

so I start with awk to select the fields under the 1st_name, do I start in the right way I'm little confused how to begin

grail · 05-19-2012, 01:00 PM

Using your data in post #3, what is your expected output?

Nominal Animal · 05-19-2012, 01:57 PM

You're pretty much on track. You can always add the interface afterwards, as long you have the functionality first.

For now, consider operations SELECT output FROM file(s) WHERE input op value only.

You can do that operation using awk, awk -v output="output" -v input="input" -v op="op" -v value="value" '...' file(s)

To prove that you are on track, consider the following. I like to set the record separator to any newline convention, and both field and record separators to consume extra whitespace.

Code:

awk -v output='1st_name' -v input='field' -v op='eq' -v value='mba' '#
    BEGIN {
        # Accept any newline convention; remove leading and trailing whitespace on lines.
        RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"

        # Comma-separated values. Remove leading and trailing whitespace.
        FS = "[\t\v\f ]*,[\t\v\f ]*"

        # Output lines use Linux newline convention (\n only)
        ORS = "\n"

        # Output lines use a comma followed by a space as the separator
        OFS = ", "

        # To compare case insensitively, we convert to uppercase.
        outputup = toupper(output)
        inputup = toupper(input)
        opup = toupper(op)
        valueup = toupper(value)

        # Convert output to array 'outname', with 'outputs' entries.
        if (output == "" || output == "*") {
            # All!
            outputs = 0
            split("", outname)
            split("", outnameup)
        } else {
            outputs = split(output, outname)
            outputs = split(outputup, outnameup)
        }
    }

    # First line of each file contains the field names.
    (FNR == 1) {
        # New file. Clear 'field' array.
        split("", fieldup)
        split("", field)

        # Save field names in uppercase to array 'field':
        for (i = 1; i <= NF; i++) {
            field[$i] = i
            fieldup[toupper($i)] = i
        }

        # Which field (number) is the input field?
        if (inputup in fieldup)
            inputnum = fieldup[inputup]
        else
            inputnum = (NF + 1)

        # Which fields to output?
        for (i = 1; i <= outputs; i++)
            if (outnameup[i] in fieldup)
                outputnum[i] = fieldup[outnameup[i]]
            else
                outputnum[i] = (NF + 1)

        # Note: If a field is not found, we use the
        #       field after the last one, $(NF + 1).
        #       It should be empty.

        # This line has no data, so skip the other rule.
        next
    }

    ( (opup == "EQ" && toupper($(inputnum)) == valueup) ||
      (opup == "LT" && toupper($(inputnum)) <  valueup) ||
      (opup == "LE" && toupper($(inputnum)) <= valueup) ||
      (opup == "GT" && toupper($(inputnum)) >  valueup) ||
      (opup == "GE" && toupper($(inputnum)) >= valueup) ||
      (opup == "NE" && toupper($(inputnum)) != valueup) ||
      (opup == "CONTAINS" && index(toupper($(inputnum)), valueup) > 0) ||
      (opup == "NOTCONTAIN" && index(toupper($(inputnum)), valueup) < 1) ||
      (opup == "MATCH" && $(inputnum) ~ value) ||
      (opup == "NOMATCH" && $(inputnum) !~ value) ) {
        # Output this record.

        if (outputs > 0) {
            # Only selected ones.
            printf("%s", $(outputnum[1]))
            for (i = 2; i <= outputs; i++)
                printf("%s%s", OFS, $(outputnum[i]))
            printf("%s", ORS)
        } else {
            # All.
            printf("%s", $1)
            for (i = 2; i <= NF; i++)
                printf("%s%s", OFS, $i)
            printf("%s", ORS)
        }
    }
    ' user_table.txt

Given the user_table.txt content, it outputs john.

If you want to be able to parse a subset of SQL queries given as text, just write a frontend in e.g. Bash, and only run the awk script when you have parsed the output input op value file(s) fields. (It would be more efficient to use a separate awk script for each comparison operator, but considering this will be rather slow anyway, I think the one-script approach is good enough.)

Note that the above implementation is a very crude example. It does not support quoted values, embedded commas or newlines: it is definitely not suited for all possible data. There are ways to fix those, but the best solution depends on whether the input file has to be CSV, or if you can use your own format (or at least escape sequences).

cll · 05-20-2012, 01:08 PM

oooh that's great, it's exactly what I'm looking for many thanks Senior Member
you discuss almost cases I could face, I will try to fit this script with my own one and my next step is to remove the selected row from the table like this:

remove from "name_of_table" where "field" eq "value"

so I have to use "sed" with "awk", right!!

I'm sorry Grail, I don't understand how to use my data in post #3..!!

cll · 05-20-2012, 02:58 PM

and another question plz
how to group all the result of "for loop" and get the output in one argument outside the loop not just the last argument
for example:

for (( i=0; i<=$(( $col_num -1 )); i++ ));
do
echo -n "${col_num[$i]}"
continue
done

echo -n ${col_num[@]} # it will just print the last for argument

Nominal Animal · 05-20-2012, 03:25 PM

Quote:

Originally Posted by cll

remove from "name_of_table" where "field" eq "value"

To manipulate the file itself, you need to make some changes. I recommend writing a different awk snippet for manipulating the database. (I'd say one awk snippet for reading/scanning the database, one for removing data from the database, and one for adding the database.)

For deletion, the approach is: output all records exactly as it is in the input file that are NOT to be deleted. (Just ignore all records that are to be deleted.)

In the shell script which runs your awk commands -- the "interface" part --, first create a safe, automatically removed temporary directory to put your temporary files into:

Code:

#!/bin/bash
WORK="$(mktemp -d)" || exit $?
trap "rm -rf '$WORK'" EXIT
set +o noclobber

The last line above disables the noclobber option. If it were enabled, Bash would abort if you tried to redirect output to an existing file. Redirecting to an existing file is useful, because that way you can replace only the file contents, while keeping the file metadata (like owner, group, access mode, extended attributes, security context if using SELinux, and so on) intact.

Instead of trying to edit the database file directly, create awk snippet that outputs the input records fully and exactly (using printf("%s\n", $0)) if they are not to be removed. This also means you must invert the matching logic, since you need it to match only for the lines that are not deleted. None of the output related functionality is not used at all, so this awk snippet should be quite a bit shorter, too.

Use the database file as the input, but redirect the output to a temporary file. Only if successful (no errors), replace the input file with the temporary file. In Bash:

Code:

if cp --preserve=all -f "$DATABASEFILE" "$WORK/tempbase" && \
   awk ... '#
        ...
       ' "$DATABASEFILE" > "$WORK/tempbase" && \
   mv -f "$WORK/tempbase" "$DATABASEFILE" ; then
    # Successfully edited "$DATABASEFILE"
else
    # Failed to edit "$DATABASEFILE"!
fi

The purpose of the cp --preserve=all -f command is NOT to copy the data -- it does, but whatever data it copies will be overwritten by the awk command anyway --, but to copy the metadata. This way the temporary file has the exact same access restrictions and security context and so on as the original file. This is why you need to disable the noclobber option I meantioned earlier; the awk output will be redirected to an existing file.

The -f flag tells cp to overwrite the tempbase file, if it happens to already exist. This way you don't need to remove the temporary database files from the working directory. You can just rely on the trap to remove all of them at once, when the script finally exits.

The && tells Bash to run the right side only if the left side succeeded; you can read it as "and if that succeeded, then". Thus, the awk bit is only run if cp succeeds, and mv only if awk succeeds. (If awk is not run, it cannot succeed. So, mv is only run if cp first succeeds, and then awk also succeeds.)

The above scheme may look like overly complicated, but it turns out to be extremely robust in practice. Even though it replaces the database files, it keeps the metadata, even extended attributes, intact. It only replaces the database file if the editing was successful; read or write errors (like running out of disk space) will not corrupt your database file.

Good luck with your efforts,