LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   use Awk to isolate a specific directory level... (https://www.linuxquestions.org/questions/linux-newbie-8/use-awk-to-isolate-a-specific-directory-level-891918/)

AndrewJS 07-15-2011 01:12 PM

use Awk to isolate a specific directory level...
 
hello

I have used Awk in the past to isolate the file name from a given path..that is to say, I may have a list of files contained in list.txt:

FIG. 1.

dir1/dir2/dir3/file1.dat
dir4/dir5/dir6/file2.dat
dir7/dir8/dir9/file3.dat
dir10/dir11/dir12/file4.dat
...and so on....

and I used the Awk command:

Code:

cat list.txt | awk -F "/" '{print $NF}'
to remove the prepended path name and so end up with list of the form:

FIG. 2.

file1.dat
file2.dat
file3.dat
file4.dat
..and so on...

I now want to do almost the exact opposite and instead of isolate the file name I want to isolate, say the middle directory in the list I have shown in Fig. 1, that is to say I want to end up with an output that would read:

Fig. 3.

dir2
dir5
dir8
dir11
...and so on...

Can someone please post the Awk command that would do this? (I assume it will be very similar in form to the Awk command I showed above.)
The point is, sometimes I may want to isolate the second directory, sometimes I may want to isolate the third directory or tenth or whatever - so I am hoping that if someone posts the Awk command to isolate the second level directory (to produce the output I showed in Fig.3) it should be fairly obvious by looking at the form of this command how to alter it and so isolate any other directory I want.

I hope I've been clear in what I'm asking!

opnsrc 07-15-2011 01:21 PM

Yes, very similar, replace $NF with $2.

Reuti 07-15-2011 01:21 PM

What about checking the man page of awk, section Fields.

MTK358 07-15-2011 01:50 PM

Quote:

Originally Posted by AndrewJS (Post 4415896)
and I used the Awk command:

Code:

cat list.txt | awk -F "/" '{print $NF}'

Awk is really unnecessary here. First, there's the basename command which is made just for this:

Code:

$ basename path/to/file
file

Also, it's possible to do it all in bash without using a command:

Code:

path=path/to/file
echo "${path##*/}"

Quote:

Originally Posted by AndrewJS (Post 4415896)
Can someone please post the Awk command that would do this? (I assume it will be very similar in form to the Awk command I showed above.)
The point is, sometimes I may want to isolate the second directory, sometimes I may want to isolate the third directory or tenth or whatever - so I am hoping that if someone posts the Awk command to isolate the second level directory (to produce the output I showed in Fig.3) it should be fairly obvious by looking at the form of this command how to alter it and so isolate any other directory I want.

I hope I've been clear in what I'm asking!

If you understand that "$NF" lets the NFth field, then it should be really obvious.

grail 07-15-2011 02:11 PM

I would add that cat is a wasted command here as well .. Just pass the file name to awk.

PTrenholme 07-15-2011 02:54 PM

If, however, you want a list of the unique directory names, do something like this:

gawk -F'/' '{++directory[$3]} END {for (i in directory) {print i " (" directory[i] " files)"}}'

Here's what the output looks like:
Code:

$ ls -1 */*/*/* | gawk -F'/' '{++directory[$3]}; END {for (i in directory) {print i " (" directory[i] " files)"}}'
 (41792 files)
The Two Faces of Tomorrow (183 files)
Bennett, Nigel (1 files)
Screen Savers (18 files)
Harald (177 files)
4 1635-The Cannon Law (118 files)
Series - Belisarius (18 files)
...

I'm not sure what you're parsing. In my example, I was piping the file list from the ls command which is not a very efficient way to do this sort of thing. (An easier way would be find ./ -maxdepth 3 -mindepth 3 -type d, but you wouldn't get the count.)

archtoad6 09-04-2011 01:37 AM

Quote:

Originally Posted by grail (Post 4415932)
I would add that cat is a wasted command here as well .. Just pass the file name to awk.

Unnecessary, yes; wasted, maybe not. I sometimes use cat this way to make the name of the file being processed stand out. IMO, this is a good programming style.

kurumi 09-04-2011 02:53 AM

Quote:

Originally Posted by archtoad6 (Post 4460891)
Unnecessary, yes; wasted, maybe not. I sometimes use cat this way to make the name of the file being processed stand out. IMO, this is a good programming style.

Making the name stand out using cat like that is less important than trying to reduce overheads plus the annoyance of the pipe chaining not able to "see" the scope of variables defined outside...eg

Code:

var=0
cat file| while read line
do
  ((var++))
done
echo "var outside: $var"

var=0
while read line
do
  ((var++))
done < file
echo "var outside: $var"

test run:
Code:

$ bash test.sh
var outside: 0
var outside: 4

Thus, IMO, this is not a good shell scripting practice and should be avoided if possible.

grail 09-04-2011 06:46 AM

+1 to kurumi's post as my sentiments exactly.

unSpawn 09-04-2011 07:31 AM

Quote:

Originally Posted by kurumi (Post 4460934)
Making the name stand out using cat like that is less important than trying to reduce overheads

...which many will recognize as UUOC (aka "The Award For The Most Gratuitous Use Of The Word Cat In A Serious Shell Script").

David the H. 09-04-2011 09:33 AM

Bash v4.2 has introduced the lastpipe shell option, which makes the last command in a pipe chain run in the current environment, ksh-style. So the variable-scope problem can now be avoided, at least. However, I think it's still better to use bash's built-in file access instead of forking off a process for the external cat.

As for the OP's request, there are also several ways we can go about it inside bash.

The first and probably best is use an array to separate the name into fields.
Code:

IFS=/
while read -a dirs; do
        echo "${dirs[1]}"                #gives you the second directory
done <file.txt

The second requires going through multiple steps parameter expansion to extract the field you want.
Code:

while read dirname; do
        dirname2="${dirname#*/}"
        dirname2="${dirname%%/*}"
        echo "$dirname2"                #gives you the second directory
done <file.txt

Finally, you can use a regular expression inside bash's [[ test to do the same.
Code:

re='([^/]+)/([^/]+)/([^/]+)/([^/]+)'
while read dirname; do
        [[ "$dirname" =~ $re ]]
        echo "${BASH_REMATCH[2]}"        #gives you the second directory
done <file.txt

I suppose you have to be careful how you construct the regex, though.

Reuti 09-04-2011 10:01 AM

Quote:

Originally Posted by unSpawn (Post 4461054)
...which many will recognize as UUOC (aka "The Award For The Most Gratuitous Use Of The Word Cat In A Serious Shell Script").

+1

If it’s important to have the name of the file in question at the beginning of the statement, I would suggest to define a function for it. Inside the function you can put it at the end to feed the while loop, but in the function call it’s the argument.

grail 09-04-2011 10:06 AM

David you forgot one of my favourite array style options :)
Code:

while read -r dirs; do
    set -- ${dirs//\// }
    echo "$1"                #gives you the first directory
done <file.txt


David the H. 09-04-2011 10:35 AM

My bad. :(

Actually I don't like recommending the positional parameters, at least not without a warning for the newbies. Since set overwrites any previous values, you might mess up your script if they're already in use for other things.

Still, it does have the benefit of not needing to set IFS.

BTW, the UUOC award text demonstrates how you can list the filename first without the use of cat. I don't know if it's any more readable, though.
Code:

<cat list.txt awk -F "/" '{print $NF}'
Redirections can be defined anywhere on the line, remember? ;)


All times are GMT -5. The time now is 09:06 AM.