LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-13-2012, 09:24 PM   #1
Skyrius
LQ Newbie
 
Registered: Apr 2012
Posts: 8

Rep: Reputation: Disabled
Bash Programming - creating hard links to duplicate files


So, first post then. I've never actually used linux for programming anything, and I'm much more used to C, but we had a homework assignment that wanted us to get use to bash scripting. Our script will be passed a single input, the directory path that we are to check. We need to compare all the regular files in that directory and if any of them are duplicates, replace them with hard links, keeping the file that is lexigraphically first (A before B, B before C, etc). Other than the fact we have to use cmp to compare to two files and ln to create the link, we're pretty much given free reign. This is what I've sort of got right now:

Code:
!/bin/bash

## Creating a sorted array containing all the regular files in the directory we were passed
N=0
for i in $(find $1 -maxdepth 1 -type f | sort -u); do
    ARRAY[$N]="$i"
let "N= $N + 1"
done

##compare each item in the array with the rest, replacing duplicates with hard links
total=${#ARRAY[@]}
for((i=0; i<=$(( $total-1)); i++))
do
    FIRST=${ARRAY[$i]}
    for((j=$i+1; j<=$(($total-2)); j++))
    do
        SECOND=${ARRAY[$j]}
        COMP=(cmp -s $FIRST $SECOND)
        if [ $COMP -eq 0 ]; then
        ln -f $FIRST $SECOND
        elif [ $COMP -eq >1 ]; then
        echo "Error reading file" 
        fi
    done
done
I think I got the syntax wrong on a few things, but my main question is that, for the if statement there, am I allowed to pass the result of a command call? I know you can do things like compare variables, and the cmp command DOES return an exit status of 0 or 1, but am I actually allowed to do that?

Basically, I have a sorted array, so I don't have to worry about the lexigraphical ordering.

Also, there's a warning that our script should be prepared to handle cases where the file names contains special characters such as spaces, "*", or leading "-" and such. However, just on the command line itself, I haven't actually run into an issue. Is there something about those that would possibly cause trouble if we run the script?

EDIT: It has occurred to me that this might have been better off in the Programming section, but I really am more or less a newbie at all of this. It confuses the heck out of me in most cases XD;;

Last edited by Skyrius; 04-13-2012 at 11:24 PM.
 
Old 04-13-2012, 11:08 PM   #2
towheedm
Member
 
Registered: Sep 2011
Location: Trinidad & Tobago
Distribution: Debian Stretch
Posts: 612

Rep: Reputation: 125Reputation: 125
To begin with, and as have been stated on several posts, please use code tags for your code. It preserves formatting and makes it must easier to read. That said, your very first mistake is:

Incorrect:
Code:
!/bin/bash
Correct:
Code:
#!/bin/bash
Quote:
am I allowed to pass the result of a command call?
You can pass the result of any command to a var using command substitution $(command):
Code:
x="$(echo "Hello") world"
echo $x
Hello world
The BASH variable "$?" always holds the exit status of the last command:
Code:
file1=/some/file
file2=/some/other/file
cmp "$file1" "$file2"
ret_val="$?"    # ret_val now holds the exit status of the cmp command
if [ "$ret_val" = "0" ]; then
  echo "Files are the same"
elif [ "$ret_val = "1" ]; then
  echo "Files are different"
else
  echo "Error reading files"
fi
Quote:
Also, there's a warning that our script should be prepared to handle cases where the file names contains special characters such as spaces, "*", or leading "-" and such
All BASH meta-characters appearing in file names must be escaped by preceding the meta-character with the escape character "\":

If the files name is 'this file has spaces in the name':
Code:
cat this\ file\ has\ spaces\ in\ the\ name
There are many good tutorials on BASH scripting. Here's one to start you off:
http://www.gnu.org/software/bash/manual/bashref.html
 
Old 04-13-2012, 11:29 PM   #3
Skyrius
LQ Newbie
 
Registered: Apr 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Ah, sorry about the tags. I've put those in now (that is a lot neater XD); And thank you for the help. I think I've got it working. And since you've explained, I have question about the escape clause too then. If I wanted to apply the cmp command to a file that had spaces or special characters in them, if I've stored that file name into an array, I'm assuming the escape characters aren't already written in for me then? So when I grab that item from the array, I have to check for special characters and insert a \ before each character manually, correct?
 
Old 04-13-2012, 11:36 PM   #4
towheedm
Member
 
Registered: Sep 2011
Location: Trinidad & Tobago
Distribution: Debian Stretch
Posts: 612

Rep: Reputation: 125Reputation: 125
Quote:
I have to check for special characters and insert a \ before each character manually, correct
Yes, as far as I know that must be done.

Last edited by towheedm; 04-13-2012 at 11:41 PM.
 
Old 04-13-2012, 11:49 PM   #5
Skyrius
LQ Newbie
 
Registered: Apr 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Alright thanks, last question. This one's really silly and I FEEL silly for not being able to get it to do what I want.

For the elif statment, I want it the code to simply do nothing. I know in C, you can just put a ;. I tried leaving it blank, but that doesn't seem to work. Then I tried echo "" to see if I could just get it to echo nothing, but unfortunately that creates a newline after every call to echo. I need to keep the case in, since if I just do else by itself, it will write out even in the cases I don't want (since I have three cases, 0, 1, or >1).
 
Old 04-14-2012, 12:00 AM   #6
yoK0
LQ Newbie
 
Registered: Apr 2012
Distribution: Slackware, CentOS
Posts: 29

Rep: Reputation: 0
Maybe wait 1 will do the work
 
Old 04-14-2012, 12:07 AM   #7
towheedm
Member
 
Registered: Sep 2011
Location: Trinidad & Tobago
Distribution: Debian Stretch
Posts: 612

Rep: Reputation: 125Reputation: 125
Well since I'm here:
The BASH built-in command:
Code:
:
does nothing except for re-directions and the sort. It always return true.

Code:
x="Hello"
echo $x
Hello
x="World"
echo $x
World
x=:
echo $x
World
:  # By itself do nothing and return and exit status of true
 
Old 04-14-2012, 12:08 AM   #8
Skyrius
LQ Newbie
 
Registered: Apr 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Darn, no. It gives me an error saying wait: pid 1 is not a child of this shell. I have to use the bash shell for the assignment, but I don't think that's it. I'm not quite sure about this error.

EDIT: Ah, I figured it out. It's "then :". That's odd syntax. I'm just not used to script language I guess XD

Ninja'd. But thank you so much for your help. I've finally got it more or less working. The special characters in the file name I can get rid of with sed, but the problem is if I call sed on the file I stored in the array, it edits the stuff inside the file and not the file name. I guess I have to store the name as a string variable and call sed on that variable first. But I've already isolated all the cases where I'll need to edit things with an escape clause, so I just need to figure out how to apply the regex to the file name and not the file contents.

Last edited by Skyrius; 04-14-2012 at 12:13 AM.
 
Old 04-14-2012, 12:25 AM   #9
towheedm
Member
 
Registered: Sep 2011
Location: Trinidad & Tobago
Distribution: Debian Stretch
Posts: 612

Rep: Reputation: 125Reputation: 125
Quote:
But I've already isolated all the cases where I'll need to edit things with an escape clause, so I just need to figure out how to apply the regex to the file name and not the file contents.
Use a pipe (|). BASH allows you to use the output of one command as the input of the next command.

Code:
echo "Hello World" | sed -n 's/World/Earthlings/p'
Hello Earthlings
 
Old 04-14-2012, 02:26 AM   #10
cbtshare
Member
 
Registered: Jul 2009
Posts: 645

Rep: Reputation: 42
your let is also wrong, never seen it worked that way.

Try:
let "N += 1"
 
Old 04-14-2012, 11:34 AM   #11
towheedm
Member
 
Registered: Sep 2011
Location: Trinidad & Tobago
Distribution: Debian Stretch
Posts: 612

Rep: Reputation: 125Reputation: 125
Code:
let "N= $N + 1"
This is actually one form of arithmetic operation accepted by BASH, although I would remove the spaces.

Code:
x=1
let "x=$x+1"
echo $x
2
I'm not sure how portable this is though.
 
Old 04-14-2012, 01:03 PM   #12
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
A few observations ...

AFAIK the output from find is already sorted so no need to pipe its output into sort. Here's a test.
Code:
c@CW8:~$ find /usr/bin -maxdepth 1 -type f > /tmp/trash
c@CW8:~$ find /usr/bin -maxdepth 1 -type f | sort > /tmp/trash2
c@CW8:~$ diff /tmp/trash /tmp/trash2
[no output]
The for i in $(find $1 -maxdepth 1 -type f | sort -u); do breaks on file names including spaces (and tabs and newlines):
Code:
c@CW8:/tmp/tmp$ touch a 'a b'
c@CW8:/tmp/tmp$ for f in $( find -type f ); do echo ">$f<"; done
>./a<
>./a<
>b<
The robust solution is to use something like
Code:
while IFS='' read -r -d '' file
do
   files+=("$file")
done < <(find $dir -type f -print0)
Notes:
  1. The -print0 makes find output file names separated by ASCII NUL characters rather than newlines.
  2. The -d '' sets read's record separator to the empty string but bash is written in C which terminates strings with ASCII NUL so that's what read actually uses as the record separator (I think; anyway, it works).
  3. The -r puts read in raw mode so it deos not interpret and "backslash escapes". For example any \t in the file name is kept as that, not translated to a tab.
  4. The IFS='' sets bash's field separator to the empty string. By default it is space tab newline so would strip any of those characters from the front and back of the file name. EDIT: more properly the "path name".
  5. files+=("$file") adds the file name to the array files. No need to use and increment an array index. The maximum array index is available as ${#array[ * ]} leading to the standard idiom for iterating over an array:
    Code:
    for (( i=0; i<${#array[ * ]}; i++ ))
    do
       # Do something with ${array[i]}
    done
Note: ${#array[ * ]} is used above to prevent the list code seeing[*] as introducing the next list item! It can more tidily be ${#array[*]}.

cmp is "expensive". If you have a lot of files, better to test progressively for identicality starting with the cheapest test first. stat --printf %s $filename would be a good choice for the first test. md5sum $filename might be a better next/last test than cmp $filename1 $filename2. Both stat and md5sum would have to be run on both files so take a bit more coding than cmp so you may prefer to stay with cmp if you don't have many files. If you do decide on the others the identicality test could be something like if [[ $(stat --printf %s "$filename1") -eq $(stat --printf %s "$filename2") ]]; then

In ARRAY[$N]="$i", the double quotes and the $ in front of N have no effect. Bash does not do "word splitting" on the expression to the right of an assignment =. Reference here. The expression inside an array index [ ] is an arithmetic expression. Bash substitutes the value of any variables named within an arithmetic expression without the need for the $ ("the value of") operator.

Similarly the string to the right of a let statement is an arithmetic expression. let is the original statement used to evaluate arithmetic expressions and is intuitively obvious. The later equivalent is (( <arithmetic expression> )) which is less obvious but has the advantage over let that it can be used as a test:
Code:
if (( <arithmetic expression> )); then
and can be substituted by its value using the $ operator: $(( <arithmetic expression> )).

Not only is the $ operator unnecessary in arithmetic expressions, whitespace either side of operators is optional so i=a+b can also be i = a + b according to taste.

EDIT: that is not true for arithmetic expressions used with a let. They must either have no space around the operators (let i=a+b or be in a string (let 'i = a + b').

This code ...
Code:
cmp "$file1" "$file2"
ret_val="$?"    # ret_val now holds the exit status of the cmp command
if [ "$ret_val" = "0" ]; then
  echo "Files are the same"
elif [ "$ret_val = "1" ]; then
  echo "Files are different"
else
  echo "Error reading files"
fi
... can be simplified to avoid the need for retval by
Code:
cmp "$file1" "$file2"
case $? in
    0 )
        echo "Files are the same"
        ;;
    1 ) 
        echo "Files are different"
        ;;
    * )
        echo "Error reading files"
esac
Quote:
If I wanted to apply the cmp command to a file that had spaces or special characters in them, if I've stored that file name into an array, I'm assuming the escape characters aren't already written in for me then? So when I grab that item from the array, I have to check for special characters and insert a \ before each character manually, correct?
No. As long as you use double quotes around the variables, their values will be passed verbatim to cmp: COMP=(cmp -s "$FIRST" "$SECOND")
Quote:
For the elif statment, I want it the code to simply do nothing. I know in C, you can just put a ;. I tried leaving it blank, but that doesn't seem to work. Then I tried echo "" to see if I could just get it to echo nothing, but unfortunately that creates a newline after every call to echo. I need to keep the case in, since if I just do else by itself, it will write out even in the cases I don't want (since I have three cases, 0, 1, or >1)
: can be used but a case statement is a more elegant solution:
Code:
case $COMP in
    0 )
        ln -f $FIRST $SECOND
        ;;
    1 )
        ;;
    * )
        echo "Error reading file"
esac

Last edited by catkin; 04-15-2012 at 05:01 AM.
 
Old 04-15-2012, 11:41 AM   #13
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: Rocky Linux
Posts: 4,779

Rep: Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212
Quote:
Originally Posted by catkin View Post
AFAIK the output from find is already sorted so no need to pipe its output into sort.
No, find just returns names in the order in which it encounters them in the directory. Unless the file names were placed in the directory in their natural collating sequence, or you happen to be using some file system that keeps its directories sorted, the output will not be sorted. Here is a result from an ext3 file system:
Code:
$ find /usr/bin -maxdepth 1 -type f | head
/usr/bin/grmiregistry
/usr/bin/[
/usr/bin/ciptool
/usr/bin/gstack
/usr/bin/unprotoize
/usr/bin/pamtopam
/usr/bin/ldns-chaos
/usr/bin/upssched-cmd
/usr/bin/aconnect
/usr/bin/cupstestppd
$ find /usr/bin -maxdepth 1 -type f | sort | head
/usr/bin/.fipscheck.hmac
/usr/bin/.ssh.hmac
/usr/bin/411toppm
/usr/bin/FBReader
/usr/bin/GET
/usr/bin/HEAD
/usr/bin/POST
/usr/bin/RSA_SecurID_getpasswd
/usr/bin/Xdialog
/usr/bin/Xdialog-gtk1
 
Old 04-15-2012, 07:08 PM   #14
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
Just a minor point, but numeric comparisons use operators '-eq', '-gt', '-ge' etc
Symbolic operators eg '=', '>' are for string comparisons
http://tldp.org/LDP/abs/html/comparison-ops.html

In general, you'll find these links useful
http://rute.2038bug.com/index.html.gz
http://tldp.org/LDP/Bash-Beginners-G...tml/index.html
http://www.tldp.org/LDP/abs/html/

I'd also recommend [[ ]] over [ ] http://tldp.org/LDP/abs/html/testcon...ml#DBLBRACKETS
 
Old 04-15-2012, 07:16 PM   #15
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
Comparing every file against the others in a loop sounds inefficient. Consider creating a list of md5sums; sorting the list; and using uniq to locate duplicates.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
find duplicate files using bash script? b-RAM Linux - Newbie 4 06-08-2010 07:05 AM
Creating hard links unixbrother Linux - Software 7 09-19-2009 05:39 AM
[SOLVED] rsync --link-dest not creating hard links on external usb drive quasi3 Linux - General 4 08-26-2009 10:11 AM
A bash script to find duplicate image files fotoguy Programming 7 01-25-2007 06:47 PM
bash: identify duplicate MS Outlook .msg files? morrolan Programming 10 10-26-2006 10:46 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 12:08 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration