Bash Programming - creating hard links to duplicate files
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Bash Programming - creating hard links to duplicate files
So, first post then. I've never actually used linux for programming anything, and I'm much more used to C, but we had a homework assignment that wanted us to get use to bash scripting. Our script will be passed a single input, the directory path that we are to check. We need to compare all the regular files in that directory and if any of them are duplicates, replace them with hard links, keeping the file that is lexigraphically first (A before B, B before C, etc). Other than the fact we have to use cmp to compare to two files and ln to create the link, we're pretty much given free reign. This is what I've sort of got right now:
Code:
!/bin/bash
## Creating a sorted array containing all the regular files in the directory we were passed
N=0
for i in $(find $1 -maxdepth 1 -type f | sort -u); do
ARRAY[$N]="$i"
let "N= $N + 1"
done
##compare each item in the array with the rest, replacing duplicates with hard links
total=${#ARRAY[@]}
for((i=0; i<=$(( $total-1)); i++))
do
FIRST=${ARRAY[$i]}
for((j=$i+1; j<=$(($total-2)); j++))
do
SECOND=${ARRAY[$j]}
COMP=(cmp -s $FIRST $SECOND)
if [ $COMP -eq 0 ]; then
ln -f $FIRST $SECOND
elif [ $COMP -eq >1 ]; then
echo "Error reading file"
fi
done
done
I think I got the syntax wrong on a few things, but my main question is that, for the if statement there, am I allowed to pass the result of a command call? I know you can do things like compare variables, and the cmp command DOES return an exit status of 0 or 1, but am I actually allowed to do that?
Basically, I have a sorted array, so I don't have to worry about the lexigraphical ordering.
Also, there's a warning that our script should be prepared to handle cases where the file names contains special characters such as spaces, "*", or leading "-" and such. However, just on the command line itself, I haven't actually run into an issue. Is there something about those that would possibly cause trouble if we run the script?
EDIT: It has occurred to me that this might have been better off in the Programming section, but I really am more or less a newbie at all of this. It confuses the heck out of me in most cases XD;;
To begin with, and as have been stated on several posts, please use code tags for your code. It preserves formatting and makes it must easier to read. That said, your very first mistake is:
Incorrect:
Code:
!/bin/bash
Correct:
Code:
#!/bin/bash
Quote:
am I allowed to pass the result of a command call?
You can pass the result of any command to a var using command substitution $(command):
Code:
x="$(echo "Hello") world"
echo $x
Hello world
The BASH variable "$?" always holds the exit status of the last command:
Code:
file1=/some/file
file2=/some/other/file
cmp "$file1" "$file2"
ret_val="$?" # ret_val now holds the exit status of the cmp command
if [ "$ret_val" = "0" ]; then
echo "Files are the same"
elif [ "$ret_val = "1" ]; then
echo "Files are different"
else
echo "Error reading files"
fi
Quote:
Also, there's a warning that our script should be prepared to handle cases where the file names contains special characters such as spaces, "*", or leading "-" and such
All BASH meta-characters appearing in file names must be escaped by preceding the meta-character with the escape character "\":
If the files name is 'this file has spaces in the name':
Ah, sorry about the tags. I've put those in now (that is a lot neater XD); And thank you for the help. I think I've got it working. And since you've explained, I have question about the escape clause too then. If I wanted to apply the cmp command to a file that had spaces or special characters in them, if I've stored that file name into an array, I'm assuming the escape characters aren't already written in for me then? So when I grab that item from the array, I have to check for special characters and insert a \ before each character manually, correct?
Alright thanks, last question. This one's really silly and I FEEL silly for not being able to get it to do what I want.
For the elif statment, I want it the code to simply do nothing. I know in C, you can just put a ;. I tried leaving it blank, but that doesn't seem to work. Then I tried echo "" to see if I could just get it to echo nothing, but unfortunately that creates a newline after every call to echo. I need to keep the case in, since if I just do else by itself, it will write out even in the cases I don't want (since I have three cases, 0, 1, or >1).
Darn, no. It gives me an error saying wait: pid 1 is not a child of this shell. I have to use the bash shell for the assignment, but I don't think that's it. I'm not quite sure about this error.
EDIT: Ah, I figured it out. It's "then :". That's odd syntax. I'm just not used to script language I guess XD
Ninja'd. But thank you so much for your help. I've finally got it more or less working. The special characters in the file name I can get rid of with sed, but the problem is if I call sed on the file I stored in the array, it edits the stuff inside the file and not the file name. I guess I have to store the name as a string variable and call sed on that variable first. But I've already isolated all the cases where I'll need to edit things with an escape clause, so I just need to figure out how to apply the regex to the file name and not the file contents.
But I've already isolated all the cases where I'll need to edit things with an escape clause, so I just need to figure out how to apply the regex to the file name and not the file contents.
Use a pipe (|). BASH allows you to use the output of one command as the input of the next command.
Code:
echo "Hello World" | sed -n 's/World/Earthlings/p'
Hello Earthlings
The for i in $(find $1 -maxdepth 1 -type f | sort -u); do breaks on file names including spaces (and tabs and newlines):
Code:
c@CW8:/tmp/tmp$ touch a 'a b'
c@CW8:/tmp/tmp$ for f in $( find -type f ); do echo ">$f<"; done
>./a<
>./a<
>b<
The robust solution is to use something like
Code:
while IFS='' read -r -d '' file
do
files+=("$file")
done < <(find $dir -type f -print0)
Notes:
The -print0 makes find output file names separated by ASCII NUL characters rather than newlines.
The -d '' sets read's record separator to the empty string but bash is written in C which terminates strings with ASCII NUL so that's what read actually uses as the record separator (I think; anyway, it works).
The -r puts read in raw mode so it deos not interpret and "backslash escapes". For example any \t in the file name is kept as that, not translated to a tab.
The IFS='' sets bash's field separator to the empty string. By default it is space tab newline so would strip any of those characters from the front and back of the file name. EDIT: more properly the "path name".
files+=("$file") adds the file name to the array files. No need to use and increment an array index. The maximum array index is available as ${#array[ * ]} leading to the standard idiom for iterating over an array:
Code:
for (( i=0; i<${#array[ * ]}; i++ ))
do
# Do something with ${array[i]}
done
Note: ${#array[ * ]} is used above to prevent the list code seeing[*] as introducing the next list item! It can more tidily be ${#array[*]}.
cmp is "expensive". If you have a lot of files, better to test progressively for identicality starting with the cheapest test first. stat --printf %s $filename would be a good choice for the first test. md5sum $filename might be a better next/last test than cmp $filename1 $filename2. Both stat and md5sum would have to be run on both files so take a bit more coding than cmp so you may prefer to stay with cmp if you don't have many files. If you do decide on the others the identicality test could be something like if [[ $(stat --printf %s "$filename1") -eq $(stat --printf %s "$filename2") ]]; then
In ARRAY[$N]="$i", the double quotes and the $ in front of N have no effect. Bash does not do "word splitting" on the expression to the right of an assignment =. Reference here. The expression inside an array index [ ] is an arithmetic expression. Bash substitutes the value of any variables named within an arithmetic expression without the need for the $ ("the value of") operator.
Similarly the string to the right of a let statement is an arithmetic expression. let is the original statement used to evaluate arithmetic expressions and is intuitively obvious. The later equivalent is (( <arithmetic expression> )) which is less obvious but has the advantage over let that it can be used as a test:
Code:
if (( <arithmetic expression> )); then
and can be substituted by its value using the $ operator: $(( <arithmetic expression> )).
Not only is the $ operator unnecessary in arithmetic expressions, whitespace either side of operators is optional so i=a+b can also be i = a + b according to taste.
EDIT: that is not true for arithmetic expressions used with a let. They must either have no space around the operators (let i=a+b or be in a string (let 'i = a + b').
This code ...
Code:
cmp "$file1" "$file2"
ret_val="$?" # ret_val now holds the exit status of the cmp command
if [ "$ret_val" = "0" ]; then
echo "Files are the same"
elif [ "$ret_val = "1" ]; then
echo "Files are different"
else
echo "Error reading files"
fi
... can be simplified to avoid the need for retval by
Code:
cmp "$file1" "$file2"
case $? in
0 )
echo "Files are the same"
;;
1 )
echo "Files are different"
;;
* )
echo "Error reading files"
esac
Quote:
If I wanted to apply the cmp command to a file that had spaces or special characters in them, if I've stored that file name into an array, I'm assuming the escape characters aren't already written in for me then? So when I grab that item from the array, I have to check for special characters and insert a \ before each character manually, correct?
No. As long as you use double quotes around the variables, their values will be passed verbatim to cmp: COMP=(cmp -s "$FIRST" "$SECOND")
Quote:
For the elif statment, I want it the code to simply do nothing. I know in C, you can just put a ;. I tried leaving it blank, but that doesn't seem to work. Then I tried echo "" to see if I could just get it to echo nothing, but unfortunately that creates a newline after every call to echo. I need to keep the case in, since if I just do else by itself, it will write out even in the cases I don't want (since I have three cases, 0, 1, or >1)
: can be used but a case statement is a more elegant solution:
Code:
case $COMP in
0 )
ln -f $FIRST $SECOND
;;
1 )
;;
* )
echo "Error reading file"
esac
AFAIK the output from find is already sorted so no need to pipe its output into sort.
No, find just returns names in the order in which it encounters them in the directory. Unless the file names were placed in the directory in their natural collating sequence, or you happen to be using some file system that keeps its directories sorted, the output will not be sorted. Here is a result from an ext3 file system:
Code:
$ find /usr/bin -maxdepth 1 -type f | head
/usr/bin/grmiregistry
/usr/bin/[
/usr/bin/ciptool
/usr/bin/gstack
/usr/bin/unprotoize
/usr/bin/pamtopam
/usr/bin/ldns-chaos
/usr/bin/upssched-cmd
/usr/bin/aconnect
/usr/bin/cupstestppd
$ find /usr/bin -maxdepth 1 -type f | sort | head
/usr/bin/.fipscheck.hmac
/usr/bin/.ssh.hmac
/usr/bin/411toppm
/usr/bin/FBReader
/usr/bin/GET
/usr/bin/HEAD
/usr/bin/POST
/usr/bin/RSA_SecurID_getpasswd
/usr/bin/Xdialog
/usr/bin/Xdialog-gtk1
Comparing every file against the others in a loop sounds inefficient. Consider creating a list of md5sums; sorting the list; and using uniq to locate duplicates.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.