BASH array / sed questions

ParaDoX667 · 04-02-2007, 01:59 AM

Hi friendly people of linuxquestions...

I'm trying at present to write a bash script that will read a list of files with a certain extension into an array, and then parse the contents of that array to a sed command.

The aim is to read in the filename from the original file run through a range of processes (SED / GAWK - for data extraction) and use the original filename (from the array) as naming through the entire process.

EG:
ATV_1234_a_wm.wag (original filename)
ATV_1234_a_wm.sed1 (after first sed process)
ATV_1234_a_wm.gwk1 (after first gawk process)
ATV_1234_a_wm-final.wag (after all files completed)
(after all the processing the sed1 / gwk1 files are removed leaving me with only the original + final.wag files)

The reason behind this is I have 1000's of files to process specific data out of and they are all named with the same naming convention (the 1234 part changes), I would like to automate my entire process so that all I need to do is run my script, select the data to be extracted (this part is already working) and process through and output the required additional file.

Is there a way to
a) easily read the filenames into an array
b) parse the array contents to sed / gawk to be used as filenames

Any help appreciated so I don't spend the next 6 months tearing my hair out.

Many Many Many thanks for anyone that can help.
Cheers!

omnio · 04-02-2007, 02:35 AM

Maybe I miss the point here, but why is it necessary to use arrays and why should sed & awk be run again on the array? Why not use something very simple, like:

Code:

#!/bin/bash
# myscript

fnc() {
    cd $1 
    for file in * ; do
        shortfile="${file%%.*}"

        ... sed the "${file}" file and output to "${shortfile}.sed1"
        ... awk the "${shortfile}.sed1" file and output to "${shortfile}.gwk1"

        rm "${shortfile}.sed1"
        mv "${shortfile}.gwk1" "${shortfile}-final.wag"
    done
}

fnc $1

And launch it like:

Code:

./myscript some-directory

ParaDoX667 · 04-02-2007, 03:15 AM

Omnio - thanks for the reply.

I'll give that a try and see what happens.
If I can read the filename in, use that to output a new file without an array I will be very happy.

(as long as the processing through SED / GAWK doesn't fail)

Cheers.

pixellany · 04-02-2007, 03:19 AM

Quote:

I'm trying at present to write a bash script that will read a list of files with a certain extension into an array, and then parse the contents of that array to a sed command.

SED does not read arrays, it reads lines. To have SED operate on a list of file names, they would have to be in a file (or a stream).

For example:
Suppose you have a directory "stuff" with your files in it.
ls stuff >namelist Puts all the file names into a new file "namelist"
Then (eg)
cat namelist|sed s/1234/5678/g >newnamelist would replace all the "1234" strings with "5678"

To use namelist to tell SED which files to operate on the contents, then you could use AWK to get the specific filename from namelist and pass it to SED (directly or thru cat) for processing. (AWK would be told to use the newline for the field separator)

ParaDoX667 · 04-02-2007, 05:25 PM

Thanks pixellanny,

I think i'm going to give up on batch processing with this script it's driving me nuts.

I thought it would be relatively easy to generate a file list of original filenames, pass each of those (one at a time) to the appropriate SED / Gawk commands extracting data for me, use the original filename to output a new file and then start on the next file in the array.

I guess I was too ambitious ....

*Cheers for the help all*

jschiwal · 04-02-2007, 05:56 PM

You could use a for loop to read in the filenames.

Code:

for file in $(cat filelist); do
  sed '<sed-command>' "${file}" >"${file}".sed1
  awk '<awk-command>' "${file}.sed1" >"${file.gwk1}"
  ...
done

If you are using awk, you may be able to have awk commands do the same thing as sed did. Also, since the input of one comes from the output of the other, you can use a pipe (as already suggested) which eliminates the need for an intermittant file.

Since the input files follow a strict pattern, "ATV_1234_a_wm.wag (original filename)". Using wildcards is easy and you don't need a list. However if the "-final.wag" is kept in the same directory, you might want to test for its existance.

Code:

for file in ATV_[[:digit:]][[:digit:]][[:digit:]][[:digit:]]_a_wm.wag; do
  if [ -f "${file%.wag}-final.wag" ]; then continue; fi
  ...  # processing instructions
  done

Use with care. Untested.

In effect, using the filename patterns, you are creating a list of the files you need to process without needing to create a filelist in the first place. This is one less manual process, which hopefully will make your life easier and eliminate one potential source of error due to missing items or typos.

Lastly, I wanted to add something to watch for. If you do have a variable or array containing the number part of the files, be sure to use double quotes around the variable when using it. Otherwise, leading zero's will be dropped.

ParaDoX667 · 04-02-2007, 07:13 PM

jschiwal warm pizza and beer for you:

IT ALL WORKS GREAT!

THANK YOU THANK YOU THANK YOU!

cfaj · 04-03-2007, 04:54 PM

Quote:

Originally Posted by jschiwal

You could use a for loop to read in the filenames.

Code:

for file in $(cat filelist); do

That is not a safe way to read a file. It will break if any filenames contain spaces or other pathological characters.

Quote:

Code:

  sed '<sed-command>' "${file}" >"${file}".sed1
  awk '<awk-command>' "${file}.sed1" >"${file.gwk1}"
  ...
done

Why bother with intermediate files? Why not pipe the output of sed directly into awk?
Code:
while IFS= read -r file
do
   sed '<sed command?' "$file" |
    awk '<awk-command>' |
     whatever > "$file_final.wag"
done < filelist

omnio · 04-04-2007, 05:12 AM

Quote:

Originally Posted by cfaj

That is not a safe way to read a file. It will break if any filenames contain spaces or other pathological characters.

I have this problem whenever I try to assign filenames to arrays. Do you know of any character which is generally not accepted in filenames and which I can switch the $IFS to? (unfortunately ":" is accepted).