using 'awk' to parse From word To word

malony101 · 11-26-2012, 07:28 AM

Hi,

Not sure it's a newbie, but here we go...

This is a script related question.
I'm trying to get a specific sentence which resides in a line.
for example - "this is a very long sentence with 9 words" and i would like to use 'awk' (or any other tool for that matter) which can print out the following: "this is a very long". first word and last word never change (they are actually symbols like '*' or '|' in my case)
but the word count in between these 2 words changes from one line to another. I'm trying 'sed' now - but not sure it would help.

Any ideas?

Velotrol · 11-26-2012, 07:37 AM

You need something like that?

Code:

echo "this is a very long sentence with 9 words" | cut -d" " -f1-n

Where n is the number of words printed. In a script you must assing a number to a variable n, and that depends on your needs.

shivaa · 11-26-2012, 09:07 AM

It can be done easily with awk, but there must be some matching pattern, so a range of variables to print can be defined in cmd. However once share a sample file for good understanding of your requirement. And also share what you've tried so far.

Turbocapitalist · 11-26-2012, 02:18 PM

Can you give a few lines of sample data showing what you really have to work with?

malony101 · 11-28-2012, 02:53 AM

Well, I found a delimiter I can use! although it is: "³". (CTRL+, guys...

it is a very small 3 which is actually (according to what I've found) a Unicode charecter and it's details are:

char: Unicode character U+00B3
Name: Superscript Three
Chart: Latin-1 Supplement
Decimal: 179
Hexadecimal: U+00B3

and so I have 2 questions:

1. Is it possible to use it "as is"? meaning - actually use ' awk -F "³" ' in order to define block selections? It's working for me though I'm not sure it'll work in any environment because ASCII translation might change from one terminal to another?
2. If not possible - is it possible to define a Unicode character as a delimiter meaning something like ' awk -F "U+00b3" or something?

The actual like that I'm trying to parse is:

Rule No 1 ³active ³Server1 ³any ³oneway

and I need just the " Rule No 1 " part. Unfortunately, white-spaces are possible..:-/

Thanks.

Turbocapitalist · 11-28-2012, 08:10 AM

awk should work as-is. Or you can write out the character like this:

Code:

awk -F "\xb3" '{print $1}'

David the H. · 11-28-2012, 12:07 PM

The bash shell and most of the other gnu tools are fully utf-8 compatible these days, as long as the environment is set up for it. You can just cut&paste the values in.

One thing that used to be difficult though is getting the shell to generate non-ascii text. But as of bash v4.2+ this has been solved. echo -e, printf %b, and the ansi-c style $'..' quoting pattern all expand "\uNNNN" unicode codepoints to their proper values.

Code:

$ echo -e $'\u00B3'
³

#to use it in a command
awk -F $'\u00B3' ....

Although as shown this is unneeded in awk or sed, as they also have similar ability built in.

In earlier bash shells you have to encode the characters as multi-byte utf-8 hex values (not raw unicode hex!), as so:

Code:

$ echo -e '\xC2\xB3'
³

On another point, how is this string being stored and supplied? If it has already been stored in a shell variable, then it should be trivial to parse it out using built-in parameter substitution or some other kind of string manipulation.

Just enable the extquote shell option first to allow you to use the ansi-c quotes inside parameter substitutions.

Code:

$ string='Rule No 1 ³active ³Server1 ³any ³oneway'
$ echo "${string%% ³*}"
Rule No 1

$ shopt -s extquote
$ echo "${string%%$' \u00B3'*}"
Rule No 1

malony101 · 12-03-2012, 02:33 AM

David, Thank you for the detailed answer!
It was very helpful!

script is working just fine with

Code:

awk -F $'\u00B3'

Thanks!!

David the H. · 12-05-2012, 06:20 PM

Glad it's working for you.

Although as I mentioned, with awk it's probably better using it's own built-in character interpreting instead of relying on the shell (see the post above mine by Turbocapitalist).

Please mark the thread as "solved".

Edit: after a couple of tests, awk apparently doesn't accept unicode points, but it can expand multi-byte strings, in the same manner as earlier versions of bash.

Code:

awk -F '\xC2\xB3'