LinuxQuestions.org - [SOLVED] Extracting text at a dynamic location

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Extracting text at a dynamic location (https://www.linuxquestions.org/questions/programming-9/extracting-text-at-a-dynamic-location-4175717290/)

ychaouche

09-29-2022 11:02 AM

Extracting text at a dynamic location

Consider a file that is structured like this

Code:



[...]



* tail

** sub level 1

[1] task foo

** sub level 2

[2.1.c] task bar



* tasks

** [1] task foo

some description

on several lines

** [2] task baz

some description

on several lines

** [2.1.c] task bar

some description

on several lines



* other sections

 ...

I'd like to write a script that :

Code:

1. Locate the last task listed in * tail section 

  in this example, it is [2.1.c] task bar



2. Display the description of that task found in the * tasks section. 

  In this example, it should display : 



** [2.1.c] task bar

some description

on several lines

I'm curious if there's a simple sed/awk solution to this. If not, I will turn to python.
But simpler is better.

boughtonp

09-29-2022 12:32 PM

This sounds familiar... https://www.linuxquestions.org/questions/programming-9/awk-sed-bash-script-to-display-last-changelog-entry-4175716978

That's a very similar problem, so yes, Awk can do this too, and you should be able to apply what you've learned from that thread.

Here are some hints: Use "\n\* " as a record separator, "\n\*\* " as a field separator, and "$NF" to refer to the last field in a record.

Using that information have a go yourself and if you get stuck show your efforts.

ychaouche

10-03-2022 05:30 AM

Thanks boughonp! A little update, I need to access last line from a field that has multiple lines,
like so :

Quote:

$ awk -v RS='\n\\* ' -v FS='\n\\*\\* ' 'NR==3 {print $(NF-3)}' ~/NOTES/LOG/TASKS/nouvelle-vm-dns.flow
2022-09-19
09:44:58
[11.1] Plan

09:48:00
[15] override

09:50:45
[16] notes.search.recent

10:08:33
[16.1] refactoring de toutes les fonctions de recherche de notes

10:20:42
[16.1.1] learning to use getopts
$

In this example, I need to extract last line which is :

Quote:

[16.1.1] learning to use getopts

boughtonp

10-04-2022 12:09 PM

There's a few ways to access the last line of a variable, perhaps the simplest is to split on newlines and access the last element of the resulting array.

Awk's split is slightly differently to other languages:

Code:

my_array_len = split(input_string,my_array,"\n");

print my_array[my_array_len];

Where "input_string" is changed to whatever variable/expression contains the lines, and "my_array" and "my_array_len" are variable which can be named however you like.

(If there's a trailing newline in the input, a -1 could be added to counter that.)

MadeInGermany

10-04-2022 04:14 PM

Or pipe it to tail -1

Code:

awk ... | tail -1

ychaouche

10-05-2022 03:55 AM

Thank you boughtonp! that code will save me a gratuitious call to length() as in my code

Code:

18:25:20 ~ -1- $ awk  -v RS='\n\\* ' -v FS='\n\\*\\* ' 'NR==3 {fieldno=NF-3; split($fieldno,A,"\n"); print A[length(A)] }'  ~/NOTES/LOG/TASKS/nouvelle-vm-dns.flow

[16.1.1] learning to use getopts

18:25:31 ~ -1- $

@MadeInGermany it's a shame to use another subprocess when you can do everything inside awk.

ychaouche

10-05-2022 06:42 AM

The code is now in its own file

Code:



#!/usr/bin/gawk -f 

BEGIN {

    RS="\n\\* ";

    FS="\n\\*\\*";

}



NR==3 {

    fieldno=NF-3; 

    l=split($fieldno,A,"\n"); 

    task=A[l];

    print("task is", task);

    printf("it was found on third record and %sth field\n", fieldno);

} 



NR==4 {

    for(i = NF; i > 0; i--) {

        #if ($i ~ "learning to use getopts") {

        if ($i ~ task ) {

            printf("expression found in %sth row, %sth field\n",NR,i);

            print $i;

        }

    } 

}

The problem is that task variable contains special characters "[" "]", so the if ($i ~ task) condition will never meet.
With the other test if ($i ~ "learning to use getopts") I get a match :

Code:

12:19:53 ~/CODE/TMP -2- $ ./awk ~/NOTES/LOG/TASKS/nouvelle-vm-dns.flow

task is [16.1.1] learning to use getopts

it was found on third record and 15th field

expression found in 4th row, 43th field

 [16.1.1] learning to use getopts  



12:33:22 ~/CODE/TMP -2- $

But with the if ($i ~ task) code I get no match

Code:

12:33:22 ~/CODE/TMP -2- $ ./awk ~/NOTES/LOG/TASKS/nouvelle-vm-dns.flow

task is [16.1.1] learning to use getopts

it was found on third record and 15th field

12:33:36 ~/CODE/TMP -2- $

boughtonp

10-05-2022 07:48 AM

To do a non-regex find, use index(haystack,needle) - returns position of match, with a starting string returning 1.

But it can be useful to convert a string to a regex pattern, by adding backslashes where required:

Code:

gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",text);

Then the pattern can have additional regex metacharacters prefixed/suffixed as required (e.g. to ensure start/end of string, variable prefixes, etc.)

ychaouche

10-05-2022 09:18 AM

gsub didn't work

Code:

NR==4 {

    for(i = NF; i > 0; i--) {

        #if ($i ~ "learning to use getopts") {

        gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",text);

        if ($i ~ task ) {

            printf("expression found in %sth row, %sth field\n",NR,i);

            print $i;

        }

    } 

}



15:04:28 ~/CODE/TMP -2- $ ./awk ~/NOTES/LOG/TASKS/nouvelle-vm-dns.flow

task is [16.1.1] learning to use getopts

it was found on third record and 15th field

15:16:50 ~/CODE/TMP -2- $

but index did.

Code:



NR==4 {

    for(i = NF; i > 0; i--) {

        #if ($i ~ "learning to use getopts") {

        # gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",text);

        # if ($i ~ task) {

        if (index($i,task)) {

            printf("expression found in %sth row, %sth field\n",NR,i);

            print $i;

        }

    } 

} 









15:16:50 ~/CODE/TMP -2- $ ./awk ~/NOTES/LOG/TASKS/nouvelle-vm-dns.flow

task is [16.1.1] learning to use getopts

it was found on third record and 15th field

expression found in 4th row, 43th field

 [16.1.1] learning to use getopts



15:17:26 ~/CODE/TMP -2- $

boughtonp

10-05-2022 09:30 AM

The "text" bit was intended to be generic - in your context you'd want something like:

Code:

...

taskrx = task

gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",taskrx);

if ($i ~ taskrx) {

...

But this is approach is more for if you want to add to the pattern - if index is sufficient that's the simpler and more efficient approach.

ychaouche

10-05-2022 10:59 AM

oops, you were right ^^', didn't check the name of the variable.

But there's something intruiguing, if I use the original variable tasks I get a very big load of backslashes printed out (see https://i.imgur.com/mgoKUtJ.png).

Code:

NR==4 {

    for(i = NF; i > 0; i--) {

        gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",task);

        if ($i ~ task) {

            printf("expression found in %sth row, %sth field\n",NR,i);

            print $i;

        }

    } 

}

If I use a copy of the variable, it works as exepcted

Code:

NR==4 {

    for(i = NF; i > 0; i--) {

        taskc=task;

        gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",taskc);

        if ($i ~ taskc) {

            printf("expression found in %sth row, %sth field\n",NR,i);

            print $i;

        }

    } 

}

ychaouche

10-05-2022 11:05 AM

Code:

gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",task);

This is handy. Should I turn into a function? I might use it in future scripts.

boughtonp

10-05-2022 11:17 AM

Quote:

Originally Posted by ychaouche (Post 6384468)

But there's something intruiguing, if I use the original variable tasks I get a very big load of backslashes printed out
...

If I use a copy of the variable, it works as exepcted

Interesting. I guess there's a quirk related to how it modifies the variable in place, and the act of assignment changes some property to resolve that.

Unless there's some documented reason for it, you should check which implementation + version of Awk you're using and probably raise it as a bug.

edit: I was distracted earlier - it's because the gsub is occurring inside a loop, and each iteration adds/doubles backslashes. If used, it should be done prior to the loop.

Quote:

Originally Posted by ychaouche (Post 6384469)

Code:

gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",task);

This is handy. Should I turn into a function? I might use it in future scripts.

Yep - it was translated from one of the functions in a regex library for a different language.

I was a little surprised it wasn't already a built-in function.

MadeInGermany

10-05-2022 01:32 PM

Oh my dear :eek:
Go for the index() function!

boughtonp

10-05-2022 05:17 PM

Quote:

Originally Posted by MadeInGermany (Post 6384508)

Go for the index() function!

Yep, the benefit of escaping is for use inside a larger pattern; when not doing that, the index function is simpler, clearer, more efficient, etc.

Also, I wasn't paying attention earlier - the excess slashes are due to the replace being performed inside a loop; if used it needs to only be done once (hence why resetting the variable immediately prior hid the issue).

All times are GMT -5. The time now is 05:21 AM.