Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
wordfile=wordcount_file
txtfound=
for i in *.txt
do
[ -f "$i" ] || continue
wc -w "$i"
txtfound=1
done > $wordfile
if [ -z "$txtfound" ]
then
for filename in *.*
do
ext=${filename##*.}
case $ext in
docx)
docx2txt "$filename"
;;
odt)
odt2txt "$filename" --output="$filename".txt
;;
pdf)
pdf2txt -o "$i".txt "$filename"
esac
done
for i in *.txt
do
[ -f "$i" ] || continue
wc -w "$i"
done > $wordfile
fi
The redirection of the whole loop allows to overwrite the wordcount file ( >> would append).
Ummm, This has the same problem as the original posted by the OP. The third for loop counts the original *.txt files that were already counted in the first loop, so it does double counting.
It also seems to only enter the second for loop if there were no text files found and counted in the first for loop. My understanding is that there may be both text files and other documents so the OP wants to count both types.
I rewrote my proposed script, made only one for loop, and simplified the processing. It also writes both the file name processed and the count out. If the filenames are not necessary simply remove the echo statements.
Code:
#!/usr/bin/bash
wordfile=wordcount_file
if [ -f $wordfile ]; then
tail $wordfile > $wordfile
fi
for filename in *.*
do
ext=${filename##*.}
case "$ext" in
docx)
echo "$filename" >> $wordfile
docx2txt "$filename" | wc -w >> $wordfile
;;
odt)
echo "$filename" >> $wordfile
odt2txt "$filename" | wc -w >> $wordfile
;;
pdf)
echo "$filename" >> $wordfile
pdf2txt "$filename" | wc -w >> $wordfile
;;
txt)
echo "$filename" >> $wordfile
cat "$filename" | wc -w >> $wordfile
;;
*)
continue
;;
esac
done
I tested it with txt, odt, and pdf files. Note that $filename is enclosed in quotes, as this allows it to process even filenames that contain spaces.
There are no extra .txt files created, simply counting the words in the existing docs.
Last edited by computersavvy; 07-12-2021 at 02:52 PM.
I think you need find utility - do some action of .txt files. Globing inside script may yield strange behavior. Star * in find command is a pattern - not globing of file names. More or less. Say
Code:
$ find ./ -name '*.txt' -exec foo '{}' \;
foo is custom script to perform action on found file. Just read manual for find. There are many useful options. Just don't get custom to create poor scripts. Poorly designed.
Edit: I think you don't need any case. Conversion programs should detect file format. So this should work
Code:
$ pdf2txt || docx2txt || odt2txt
order depends on what kind of files are more frequent.
Please see the red-quoted text. In the beginning the variable "txtfound" is mentioned without any value, then later the value "1" is given to it. I am not sure but does "1" here means that the file is present?
How this "txtfound" works here?
Thanks
Yes, a 1 (not-empty) value means that a .txt file was found.
[ -z "$txtfound" ]
is true if the variable is empty (zero).
A correction:
Code:
pdf2txt -o "$filename".txt "$filename"
I kept the intention in post #1, perhaps it needs a correction as well.
Last edited by MadeInGermany; 07-13-2021 at 04:47 AM.
I think you need find utility - do some action of .txt files. Globing inside script may yield strange behavior. Star * in find command is a pattern - not globing of file names. More or less. Say
Code:
$ find ./ -name '*.txt' -exec foo '{}' \;
foo is custom script to perform action on found file. Just read manual for find. There are many useful options. Just don't get custom to create poor scripts. Poorly designed.
Edit: I think you don't need any case. Conversion programs should detect file format. So this should work
Code:
$ pdf2txt || docx2txt || odt2txt
order depends on what kind of files are more frequent.
May be 'find' will also work in this situation (I am not sure), but just on the lighter note, I will reply by quoting a lyric of Daft Punk's song "Get Lucky":
Quote:
we've come too far to give up who we are
As a newbiew in bash scripting, I put up so much effort in this script that even the thought of re-writing it makes me tired. I will definitely rest for few days after completing this
Last edited by salmanahmed; 07-13-2021 at 12:10 PM.
I rewrote my proposed script, made only one for loop, and simplified the processing. It also writes both the file name processed and the count out. If the filenames are not necessary simply remove the echo statements.
Code:
#!/usr/bin/bash
wordfile=wordcount_file
if [ -f $wordfile ]; then
tail $wordfile > $wordfile
fi
for filename in *.*
do
ext=${filename##*.}
case "$ext" in
docx)
echo "$filename" >> $wordfile
docx2txt "$filename" | wc -w >> $wordfile
;;
odt)
echo "$filename" >> $wordfile
odt2txt "$filename" | wc -w >> $wordfile
;;
pdf)
echo "$filename" >> $wordfile
pdf2txt "$filename" | wc -w >> $wordfile
;;
txt)
echo "$filename" >> $wordfile
cat "$filename" | wc -w >> $wordfile
;;
*)
continue
;;
esac
done
I tested it with txt, odt, and pdf files. Note that $filename is enclosed in quotes, as this allows it to process even filenames that contain spaces.
There are no extra .txt files created, simply counting the words in the existing docs.
No. it's not calculating the wordcount of all the files. The suggestions made by "MadeInGermany" worked perfectly. However, I must say that you also helped me a lot. I really appreciate that you've spared some of your precious time and look into my problem.
Thanks a lot buddy
Yes, a 1 (not-empty) value means that a .txt file was found.
[ -z "$txtfound" ]
is true if the variable is empty (zero).
A correction:
Code:
pdf2txt -o "$filename".txt "$filename"
I kept the intention in post #1, perhaps it needs a correction as well.
Initially I was confused, but then "man test" helped me about "-z" option. After that everything was clear. Your suggestions solved my problem. Thanks a lot for sparing your precious time for me
One last thing before closing the topic. Can you please recommend me some good books on bash programming for following levels:
1. Beginners level
2. Intermediate level
3. Advance level
May be 'find' will also work in this situation (I am not sure), but just on the lighter note, I will reply by quoting a lyric of Daft Punk's song "Get Lucky":
DOCX=(*.docx)
ODT=(*.odt)
PDF=(*.pdf)
# correct conversion command format so they produce files with .txt suffix
for i in ${DOCX[@]} ; do docx2txt "$i" ; done
for i in ${ODT[@]} ; do odt2txt "$i$ ; done
for i in ${PDF[@]} ; do pdf2txt "$i"; done
# the last
wc *.txt > total_word_count
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.