[SOLVED] Bash script - too many arguments

computersavvy · 07-12-2021, 02:39 PM

Quote:

Originally Posted by MadeInGermany

Have a loop and a "found" variable

Code:

wordfile=wordcount_file
txtfound=
for i in *.txt
do
  [ -f "$i" ] || continue
  wc -w "$i"
  txtfound=1
done > $wordfile
if [ -z "$txtfound" ]
then
  for filename in *.*
  do
    ext=${filename##*.}
    case $ext in
    docx)
      docx2txt "$filename"
    ;;
    odt)
      odt2txt "$filename" --output="$filename".txt
    ;;
    pdf)
      pdf2txt -o "$i".txt "$filename"
    esac
  done
  for i in *.txt
  do
    [ -f "$i" ] || continue
    wc -w "$i"
  done > $wordfile
fi

The redirection of the whole loop allows to overwrite the wordcount file ( >> would append).

Ummm, This has the same problem as the original posted by the OP. The third for loop counts the original *.txt files that were already counted in the first loop, so it does double counting.

It also seems to only enter the second for loop if there were no text files found and counted in the first for loop. My understanding is that there may be both text files and other documents so the OP wants to count both types.

computersavvy · 07-12-2021, 02:50 PM

I rewrote my proposed script, made only one for loop, and simplified the processing. It also writes both the file name processed and the count out. If the filenames are not necessary simply remove the echo statements.

Code:

#!/usr/bin/bash

wordfile=wordcount_file
if [ -f $wordfile ]; then
    tail $wordfile > $wordfile
fi

for filename in *.* 
do
    ext=${filename##*.}
    case "$ext" in
        docx)   
            echo "$filename" >> $wordfile
            docx2txt "$filename" | wc -w >> $wordfile
        ;;
        odt) 
            echo "$filename" >> $wordfile
            odt2txt "$filename"  | wc -w >> $wordfile
        ;;
        pdf) 
            echo "$filename" >> $wordfile
            pdf2txt "$filename" | wc -w >> $wordfile
        ;;
        txt)
            echo "$filename" >> $wordfile
            cat "$filename" | wc -w  >> $wordfile
        ;;
        *)
            continue
        ;;
    esac
 done

I tested it with txt, odt, and pdf files. Note that $filename is enclosed in quotes, as this allows it to process even filenames that contain spaces.

There are no extra .txt files created, simply counting the words in the existing docs.

igadoter · 07-12-2021, 06:13 PM

I think you need find utility - do some action of .txt files. Globing inside script may yield strange behavior. Star * in find command is a pattern - not globing of file names. More or less. Say

Code:

$ find ./ -name '*.txt' -exec foo '{}' \;

foo is custom script to perform action on found file. Just read manual for find. There are many useful options. Just don't get custom to create poor scripts. Poorly designed.

Edit: I think you don't need any case. Conversion programs should detect file format. So this should work

Code:

$ pdf2txt || docx2txt || odt2txt

order depends on what kind of files are more frequent.

MadeInGermany · 07-13-2021, 04:34 AM

Quote:

Originally Posted by salmanahmed

Please see the red-quoted text. In the beginning the variable "txtfound" is mentioned without any value, then later the value "1" is given to it. I am not sure but does "1" here means that the file is present?
How this "txtfound" works here?
Thanks

Yes, a 1 (not-empty) value means that a .txt file was found.
[ -z "$txtfound" ]
is true if the variable is empty (zero).

A correction:

Code:

    pdf2txt -o "$filename".txt "$filename"

I kept the intention in post #1, perhaps it needs a correction as well.

salmanahmed · 07-13-2021, 12:00 PM

Quote:

Originally Posted by igadoter

I think you need find utility - do some action of .txt files. Globing inside script may yield strange behavior. Star * in find command is a pattern - not globing of file names. More or less. Say

Code:

$ find ./ -name '*.txt' -exec foo '{}' \;

foo is custom script to perform action on found file. Just read manual for find. There are many useful options. Just don't get custom to create poor scripts. Poorly designed.

Edit: I think you don't need any case. Conversion programs should detect file format. So this should work

Code:

$ pdf2txt || docx2txt || odt2txt

order depends on what kind of files are more frequent.

May be 'find' will also work in this situation (I am not sure), but just on the lighter note, I will reply by quoting a lyric of Daft Punk's song "Get Lucky":

Quote:

we've come too far to give up who we are

As a newbiew in bash scripting, I put up so much effort in this script that even the thought of re-writing it makes me tired. I will definitely rest for few days after completing this

salmanahmed · 07-13-2021, 12:05 PM

Quote:

Originally Posted by computersavvy

I rewrote my proposed script, made only one for loop, and simplified the processing. It also writes both the file name processed and the count out. If the filenames are not necessary simply remove the echo statements.

Code:

#!/usr/bin/bash

wordfile=wordcount_file
if [ -f $wordfile ]; then
    tail $wordfile > $wordfile
fi

for filename in *.* 
do
    ext=${filename##*.}
    case "$ext" in
        docx)   
            echo "$filename" >> $wordfile
            docx2txt "$filename" | wc -w >> $wordfile
        ;;
        odt) 
            echo "$filename" >> $wordfile
            odt2txt "$filename"  | wc -w >> $wordfile
        ;;
        pdf) 
            echo "$filename" >> $wordfile
            pdf2txt "$filename" | wc -w >> $wordfile
        ;;
        txt)
            echo "$filename" >> $wordfile
            cat "$filename" | wc -w  >> $wordfile
        ;;
        *)
            continue
        ;;
    esac
 done

I tested it with txt, odt, and pdf files. Note that $filename is enclosed in quotes, as this allows it to process even filenames that contain spaces.

There are no extra .txt files created, simply counting the words in the existing docs.

No. it's not calculating the wordcount of all the files. The suggestions made by "MadeInGermany" worked perfectly. However, I must say that you also helped me a lot. I really appreciate that you've spared some of your precious time and look into my problem.
Thanks a lot buddy

salmanahmed · 07-13-2021, 12:07 PM

Quote:

Originally Posted by MadeInGermany

Yes, a 1 (not-empty) value means that a .txt file was found.
[ -z "$txtfound" ]
is true if the variable is empty (zero).

A correction:

Code:

    pdf2txt -o "$filename".txt "$filename"

I kept the intention in post #1, perhaps it needs a correction as well.

Initially I was confused, but then "man test" helped me about "-z" option. After that everything was clear. Your suggestions solved my problem. Thanks a lot for sparing your precious time for me

salmanahmed · 07-13-2021, 12:08 PM

One last thing before closing the topic. Can you please recommend me some good books on bash programming for following levels:
1. Beginners level
2. Intermediate level
3. Advance level

Thanks

computersavvy · 07-13-2021, 12:22 PM

This and this are very good tutorials, among many others found with a simple online search for "bash tutorial" or similar.

salmanahmed · 07-13-2021, 12:55 PM

Quote:

Originally Posted by computersavvy

This and this are very good tutorials, among many others found with a simple online search for "bash tutorial" or similar.

Thanks a lot

boughtonp · 07-13-2021, 05:45 PM

Quote:

Originally Posted by salmanahmed

May be 'find' will also work in this situation (I am not sure), but just on the lighter note, I will reply by quoting a lyric of Daft Punk's song "Get Lucky":

Quote:

we've come too far to give up who we are

...

That's not always a good approach (and not quite what they are advocating), so maybe you should take your inspiration from track twelve instead.

Anyway, not a book/tutorial, but ShellCheck is a really useful tool which can highlight (some) bugs and warn against potential issues.

salmanahmed · 07-14-2021, 07:00 AM

Quote:

Originally Posted by boughtonp

Anyway, not a book/tutorial, but ShellCheck is a really useful tool which can highlight (some) bugs and warn against potential issues.

Great utility. Thanks a lot

dugan · 07-14-2021, 09:37 AM

This is one of the newer BASH books. It has a good reputation.

https://linuxcommand.org/tlcl.php

igadoter · 07-14-2021, 11:10 AM

Ok what about

Code:

DOCX=(*.docx)
ODT=(*.odt)
PDF=(*.pdf)

# correct conversion command format so they produce files with .txt suffix
for i in ${DOCX[@]} ; do docx2txt "$i" ; done 
for i in ${ODT[@]} ; do  odt2txt "$i$ ; done
for i in ${PDF[@]} ; do pdf2txt  "$i"; done

# the last
wc *.txt > total_word_count

Say

Code:

$ wc -w *.info > /tmp/total_word_info
$ cat /tmp/total_word_info 
  11 Jinja2.info
  11 MarkupSafe.info
  21 Sphinx.info
  11 alabaster.info
  11 imagesize.info
  11 mando.info
  15 python3-babel.info
  11 pytz.info
  11 snowballstemmer.info
  11 sphinxcontrib-applehelp.info
  11 sphinxcontrib-devhelp.info
  11 sphinxcontrib-htmlhelp.info
  11 sphinxcontrib-jsmath.info
  11 sphinxcontrib-qthelp.info
  11 sphinxcontrib-serializinghtml.info
 179 total

So you really don't want to go back? At least run what I posted.

MadeInGermany · 07-14-2021, 12:45 PM

You can and should quote the @ references in order to protect them from expansions (field splitting and filename generation).

Code:

DOCX=(*.docx)
for i in "${DOCX[@]}" ; do docx2txt "$i" ; done

Or directly feed the loop:

Code:

for i in *.docx; do docx2txt "$i" ; done

Field splitting: split at $IFS (normally whitespace).
Filename generation: expand wildcards like * with matching filenames.