LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   using 'awk' to parse From word To word (https://www.linuxquestions.org/questions/linux-newbie-8/using-awk-to-parse-from-word-to-word-4175438760/)

malony101 11-26-2012 07:28 AM

using 'awk' to parse From word To word
 
Hi,

Not sure it's a newbie, but here we go...:)
This is a script related question.
I'm trying to get a specific sentence which resides in a line.
for example - "this is a very long sentence with 9 words" and i would like to use 'awk' (or any other tool for that matter) which can print out the following: "this is a very long". first word and last word never change (they are actually symbols like '*' or '|' in my case)
but the word count in between these 2 words changes from one line to another. I'm trying 'sed' now - but not sure it would help.

Any ideas?

Velotrol 11-26-2012 07:37 AM

You need something like that?
Code:

echo "this is a very long sentence with 9 words" | cut -d" " -f1-n
Where n is the number of words printed. In a script you must assing a number to a variable n, and that depends on your needs.

shivaa 11-26-2012 09:07 AM

It can be done easily with awk, but there must be some matching pattern, so a range of variables to print can be defined in cmd. However once share a sample file for good understanding of your requirement. And also share what you've tried so far.

Turbocapitalist 11-26-2012 02:18 PM

Can you give a few lines of sample data showing what you really have to work with?

malony101 11-28-2012 02:53 AM

Well, I found a delimiter I can use! although it is: "³". (CTRL+, guys...:) it is a very small 3 which is actually (according to what I've found) a Unicode charecter and it's details are:

char: Unicode character U+00B3
Name: Superscript Three
Chart: Latin-1 Supplement
Decimal: 179
Hexadecimal: U+00B3

and so I have 2 questions:

1. Is it possible to use it "as is"? meaning - actually use ' awk -F "³" ' in order to define block selections? It's working for me though I'm not sure it'll work in any environment because ASCII translation might change from one terminal to another?
2. If not possible - is it possible to define a Unicode character as a delimiter meaning something like ' awk -F "U+00b3" or something?

The actual like that I'm trying to parse is:

Rule No 1 ³active ³Server1 ³any ³oneway

and I need just the " Rule No 1 " part. Unfortunately, white-spaces are possible..:-/

Thanks.

Turbocapitalist 11-28-2012 08:10 AM

awk should work as-is. Or you can write out the character like this:

Code:

awk -F "\xb3" '{print $1}'

David the H. 11-28-2012 12:07 PM

The bash shell and most of the other gnu tools are fully utf-8 compatible these days, as long as the environment is set up for it. You can just cut&paste the values in.

One thing that used to be difficult though is getting the shell to generate non-ascii text. But as of bash v4.2+ this has been solved. echo -e, printf %b, and the ansi-c style $'..' quoting pattern all expand "\uNNNN" unicode codepoints to their proper values.

Code:

$ echo -e $'\u00B3'
³

#to use it in a command
awk -F $'\u00B3' ....

Although as shown this is unneeded in awk or sed, as they also have similar ability built in.

In earlier bash shells you have to encode the characters as multi-byte utf-8 hex values (not raw unicode hex!), as so:

Code:

$ echo -e '\xC2\xB3'
³

On another point, how is this string being stored and supplied? If it has already been stored in a shell variable, then it should be trivial to parse it out using built-in parameter substitution or some other kind of string manipulation.

Just enable the extquote shell option first to allow you to use the ansi-c quotes inside parameter substitutions.

Code:

$ string='Rule No 1 ³active ³Server1 ³any ³oneway'
$ echo "${string%% ³*}"
Rule No 1

$ shopt -s extquote
$ echo "${string%%$' \u00B3'*}"
Rule No 1


malony101 12-03-2012 02:33 AM

David, Thank you for the detailed answer!
It was very helpful! :)
script is working just fine with
Code:

awk -F $'\u00B3'
Thanks!! :)

David the H. 12-05-2012 06:20 PM

Glad it's working for you.

Although as I mentioned, with awk it's probably better using it's own built-in character interpreting instead of relying on the shell (see the post above mine by Turbocapitalist).


Please mark the thread as "solved".


Edit: after a couple of tests, awk apparently doesn't accept unicode points, but it can expand multi-byte strings, in the same manner as earlier versions of bash.

Code:

awk -F '\xC2\xB3'


All times are GMT -5. The time now is 04:06 AM.