LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Extracting text at a dynamic location (https://www.linuxquestions.org/questions/programming-9/extracting-text-at-a-dynamic-location-4175717290/)

ychaouche 09-29-2022 11:02 AM

Extracting text at a dynamic location
 
Consider a file that is structured like this

Code:


[...]

* tail
** sub level 1
[1] task foo
** sub level 2
[2.1.c] task bar

* tasks
** [1] task foo
some description
on several lines
** [2] task baz
some description
on several lines
** [2.1.c] task bar
some description
on several lines

* other sections
 ...

I'd like to write a script that :

Code:

1. Locate the last task listed in * tail section
  in this example, it is [2.1.c] task bar

2. Display the description of that task found in the * tasks section.
  In this example, it should display :

** [2.1.c] task bar
some description
on several lines


I'm curious if there's a simple sed/awk solution to this. If not, I will turn to python.
But simpler is better.

boughtonp 09-29-2022 12:32 PM


 
This sounds familiar... https://www.linuxquestions.org/questions/programming-9/awk-sed-bash-script-to-display-last-changelog-entry-4175716978

That's a very similar problem, so yes, Awk can do this too, and you should be able to apply what you've learned from that thread.

Here are some hints: Use "\n\* " as a record separator, "\n\*\* " as a field separator, and "$NF" to refer to the last field in a record.

Using that information have a go yourself and if you get stuck show your efforts.


ychaouche 10-03-2022 05:30 AM

Thanks boughonp! A little update, I need to access last line from a field that has multiple lines,
like so :

Quote:

$ awk -v RS='\n\\* ' -v FS='\n\\*\\* ' 'NR==3 {print $(NF-3)}' ~/NOTES/LOG/TASKS/nouvelle-vm-dns.flow
2022-09-19
09:44:58
[11.1] Plan

09:48:00
[15] override


09:50:45
[16] notes.search.recent

10:08:33
[16.1] refactoring de toutes les fonctions de recherche de notes

10:20:42
[16.1.1] learning to use getopts
$
In this example, I need to extract last line which is :

Quote:

[16.1.1] learning to use getopts

boughtonp 10-04-2022 12:09 PM


 
There's a few ways to access the last line of a variable, perhaps the simplest is to split on newlines and access the last element of the resulting array.

Awk's split is slightly differently to other languages:
Code:

my_array_len = split(input_string,my_array,"\n");
print my_array[my_array_len];

Where "input_string" is changed to whatever variable/expression contains the lines, and "my_array" and "my_array_len" are variable which can be named however you like.

(If there's a trailing newline in the input, a -1 could be added to counter that.)


MadeInGermany 10-04-2022 04:14 PM

Or pipe it to tail -1
Code:

awk ... | tail -1

ychaouche 10-05-2022 03:55 AM

Thank you boughtonp! that code will save me a gratuitious call to length() as in my code

Code:

18:25:20 ~ -1- $ awk  -v RS='\n\\* ' -v FS='\n\\*\\* ' 'NR==3 {fieldno=NF-3; split($fieldno,A,"\n"); print A[length(A)] }'  ~/NOTES/LOG/TASKS/nouvelle-vm-dns.flow
[16.1.1] learning to use getopts
18:25:31 ~ -1- $

@MadeInGermany it's a shame to use another subprocess when you can do everything inside awk.

ychaouche 10-05-2022 06:42 AM

The code is now in its own file

Code:


#!/usr/bin/gawk -f
BEGIN {
    RS="\n\\* ";
    FS="\n\\*\\*";
}

NR==3 {
    fieldno=NF-3;
    l=split($fieldno,A,"\n");
    task=A[l];
    print("task is", task);
    printf("it was found on third record and %sth field\n", fieldno);
}

NR==4 {
    for(i = NF; i > 0; i--) {
        #if ($i ~ "learning to use getopts") {
        if ($i ~ task ) {
            printf("expression found in %sth row, %sth field\n",NR,i);
            print $i;
        }
    }
}



The problem is that task variable contains special characters "[" "]", so the if ($i ~ task) condition will never meet.
With the other test if ($i ~ "learning to use getopts") I get a match :

Code:

12:19:53 ~/CODE/TMP -2- $ ./awk ~/NOTES/LOG/TASKS/nouvelle-vm-dns.flow
task is [16.1.1] learning to use getopts
it was found on third record and 15th field
expression found in 4th row, 43th field
 [16.1.1] learning to use getopts 

12:33:22 ~/CODE/TMP -2- $

But with the if ($i ~ task) code I get no match

Code:

12:33:22 ~/CODE/TMP -2- $ ./awk ~/NOTES/LOG/TASKS/nouvelle-vm-dns.flow
task is [16.1.1] learning to use getopts
it was found on third record and 15th field
12:33:36 ~/CODE/TMP -2- $


boughtonp 10-05-2022 07:48 AM


 
To do a non-regex find, use index(haystack,needle) - returns position of match, with a starting string returning 1.

But it can be useful to convert a string to a regex pattern, by adding backslashes where required:
Code:

gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",text);
Then the pattern can have additional regex metacharacters prefixed/suffixed as required (e.g. to ensure start/end of string, variable prefixes, etc.)


ychaouche 10-05-2022 09:18 AM

gsub didn't work

Code:

NR==4 {
    for(i = NF; i > 0; i--) {
        #if ($i ~ "learning to use getopts") {
        gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",text);
        if ($i ~ task ) {
            printf("expression found in %sth row, %sth field\n",NR,i);
            print $i;
        }
    }
}

15:04:28 ~/CODE/TMP -2- $ ./awk ~/NOTES/LOG/TASKS/nouvelle-vm-dns.flow
task is [16.1.1] learning to use getopts
it was found on third record and 15th field
15:16:50 ~/CODE/TMP -2- $


but index did.

Code:


NR==4 {
    for(i = NF; i > 0; i--) {
        #if ($i ~ "learning to use getopts") {
        # gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",text);
        # if ($i ~ task) {
        if (index($i,task)) {
            printf("expression found in %sth row, %sth field\n",NR,i);
            print $i;
        }
    }
}




15:16:50 ~/CODE/TMP -2- $ ./awk ~/NOTES/LOG/TASKS/nouvelle-vm-dns.flow
task is [16.1.1] learning to use getopts
it was found on third record and 15th field
expression found in 4th row, 43th field
 [16.1.1] learning to use getopts

15:17:26 ~/CODE/TMP -2- $


boughtonp 10-05-2022 09:30 AM


 
The "text" bit was intended to be generic - in your context you'd want something like:

Code:

...
taskrx = task
gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",taskrx);
if ($i ~ taskrx) {
...

But this is approach is more for if you want to add to the pattern - if index is sufficient that's the simpler and more efficient approach.


ychaouche 10-05-2022 10:59 AM

oops, you were right ^^', didn't check the name of the variable.

But there's something intruiguing, if I use the original variable tasks I get a very big load of backslashes printed out (see https://i.imgur.com/mgoKUtJ.png).

Code:

NR==4 {
    for(i = NF; i > 0; i--) {
        gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",task);
        if ($i ~ task) {
            printf("expression found in %sth row, %sth field\n",NR,i);
            print $i;
        }
    }
}

If I use a copy of the variable, it works as exepcted


Code:

NR==4 {
    for(i = NF; i > 0; i--) {
        taskc=task;
        gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",taskc);
        if ($i ~ taskc) {
            printf("expression found in %sth row, %sth field\n",NR,i);
            print $i;
        }
    }
}


ychaouche 10-05-2022 11:05 AM

Code:

gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",task);
This is handy. Should I turn into a function? I might use it in future scripts.

boughtonp 10-05-2022 11:17 AM

Quote:

Originally Posted by ychaouche (Post 6384468)
But there's something intruiguing, if I use the original variable tasks I get a very big load of backslashes printed out
...

If I use a copy of the variable, it works as exepcted

Interesting. I guess there's a quirk related to how it modifies the variable in place, and the act of assignment changes some property to resolve that.

Unless there's some documented reason for it, you should check which implementation + version of Awk you're using and probably raise it as a bug.


edit: I was distracted earlier - it's because the gsub is occurring inside a loop, and each iteration adds/doubles backslashes. If used, it should be done prior to the loop.


Quote:

Originally Posted by ychaouche (Post 6384469)
Code:

gsub(/[$^*()+\[\]{}.?\\|]/,"\\\\&",task);
This is handy. Should I turn into a function? I might use it in future scripts.

Yep - it was translated from one of the functions in a regex library for a different language.

I was a little surprised it wasn't already a built-in function.


MadeInGermany 10-05-2022 01:32 PM

Oh my dear :eek:
Go for the index() function!

boughtonp 10-05-2022 05:17 PM

Quote:

Originally Posted by MadeInGermany (Post 6384508)
Go for the index() function!

Yep, the benefit of escaping is for use inside a larger pattern; when not doing that, the index function is simpler, clearer, more efficient, etc.


Also, I wasn't paying attention earlier - the excess slashes are due to the replace being performed inside a loop; if used it needs to only be done once (hence why resetting the variable immediately prior hid the issue).



All times are GMT -5. The time now is 05:21 AM.