[SOLVED] Compare two files and output differences to new file
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Compare two files and output differences to new file
I am not very familiar with bash coding and I managed to create this coding by looking at different websites and forums. But I still can't get this to work.
What I am trying to do is compare a list of domains in each category in an old blacklist/whitelist and a list of domains in a new blacklist/whitelist. The aim is to delete the old blacklist and add any missing domains in the old blacklist to the new blacklist.
I have two files - comparedomains.sh and domain_mapping.txt
In domain_mapping.txt I have a path which points the script to the location of the old blacklist file called domains and the location of the new blacklist file also called domains. The idea is that I should be able to include every category as listed below in that file. I have only added cleaning category for testing purposes:
Code:
old/cleaning/domains;new/cleaning/domains
In the comparedomains.sh script I have the following:
Code:
#!/bin/bash
# Create a file which has the mappings of which two files to compare
# e.g. in domain_mapping.txt
# old/cleaning/domains;new/cleaning/domains
mapping_file="/home/domain_mapping.txt"
# missing_list will be the domains that are missing from new domain but are in old domain file.
# This file will be created in each directory, e.g. new/cleaning/missing_list
while read -r mapping;
do
echo "Is this even mapping"
new_file="${mapping#*,}"
missing_list=$(echo $new_file | sed 's/domains/missing/g')
old_file="${mapping%,*}"
echo "Comparing ${old_file} and ${new_file}"
# Initialise files
rm $missing_list 1>/dev/null 2>&1
while read -r website;
echo "Is this even website"
do
if [[ ! $(grep $website $new_file) ]]; then
echo $website >> $missing_list
fi
done < $old_file
echo ""
echo "Completed ${old_file}"
echo ""
done < $mapping_file
echo "Completed all mapping file comparisons"
exit 0
The script should read the domain_mapping.txt file, get the old domain file as being old/cleaning/domains, getthe new domain file as new/cleaning/domains, compare the differences between those two files and then create a missing_list in new/cleaning folder for any domains that are in the old domain but not new domain. The order of the domain listings is irrelevant. I just want to know that domains abc are not in the new domain file - I don't care what line the domains are listed in both filess
However, the script never seems to run anything between "while read" and "done" I added in echo "Is this even mapping" and echo "Is this even website" and this never shows in the command output. All I ever see output is "Completed all mapping file comparisons". Everything between "while" and "done" seems to be ignored.
I think it may have something to do with this line - while read -r mapping; My understanding is that the name can be anything so I called it mapping. But maybe I am not understanding how the while read -r command works. But maybe the problem is something else? I can't see what the problem is, but it would appear from my test echos (mapping) and (website) that the while read section is not even running.
Don't be too harsh, we all gotta learn what we need to ask.
It was not intended as such, merely to reinforce what Pan was saying - though it was a bit of a hasty response, and on reflection I don't agree diff is the optimal tool.
The first of those essentially gives the content for the new new file, whilst the second is useful if there's a reason to segment the old file data in some fashion.
Sorry for the late reply. Was busy with something else yesterday, which took longer than expected.
I have managed to get it working Yay, so I won't worry about starting all over again using diff. However I have absolutely no idea why I am experiencing this issue.
I copy the comparedomains.sh file and domain_mapping.txt file from my windows computer to my Linux virtual machine (Slackware) along with the domain files that I wish to compare.
On the Linux machine, I then chmod 755 comparedomains.sh to make the script executable. When I run the script, (using the set -x option suggested by MichaelK), I get the following output:
mapping_file="domain_mapping.txt"
read -r mapping
echo "Completed all mapping file comparisons)
Now if I edit the domain_mapping.txt (in Linux - say remove a letter and add the letter back again) and save the file and then run the script - everything works! For some reason, when you copy that text file from Windows to Linux, the script obviously can't open/read the file. Editing that txt file in Linux and saving it makes the file open/readable. I discovered this by accident as I had domain instead of domains as the name of one of the blacklist files. However that typo was not the cause of the script failing. I tested this by putting in correct and incorrect paths. (Once you have edited that file in Linux), the script reports an error file or directory not found if the path is wrong.
I tested whether it makes any difference if the file in question has a .txt extension. Makes no difference.
I will give best answer to MichaelK as that put me on the right track with the suggestion of using set -x. For the record, I did pick up the error regarding "," versus ";" but that made no difference as the file wasn't being read obviously.
My script now works perfectly, but if anyone can explain why I need to open the domain_mapping file on Linux, make a change and save it before the comparedomains.sh script can open that file that would be fantastic. I have no problems with scripts reading the domains files (which were also copied from Windows and were not edited on Linux), so I cannot see why this particular file is so special?
It appears like a end of line character problem, Windows text files and linux text files use different end of line characters. linux uses just the lf whereas Windows uses cr,lf.
There are many ways to convert from Windows/DOS to linux end of line characters. One example from the command line using tr is:
Would that explain why the domains files work? Presumably those files would have been created on a Linux machine at some point even though I have edited them on Windows (or at least edited a couple of them). The domain files were originally downloaded from the Internet on my windows computer.
I presume that this problem only occurs for text files. Pretty much every .sh file that I have run on Linux was probably created on Windows and then copied across and chmod 755 to make it run.
Thank you, I will have a look into this end of line issue as I have never encountered it before. Surprising that Slackware doesn't report that it can't open the file or something.
I have managed to get it working Yay, so I won't worry about starting all over again using diff.
...
My script now works perfectly
Perfectly is a bold claim. Does it really behave as desired for all possible input variations? Or did you do what a lot of programmers do and test only a single use case?
(For example, did you fix the bugs caused by unquoted filename variables present in the original script?)
while read -r mapping;
do
echo "Is this even mapping"
new_file="${mapping#*,}"
...
while read -r website;
echo "Is this even website"
do
if [[ ! $(grep $website $new_file) ]]; then
echo $website >> $missing_list
fi
done < $old_file
...
done < $mapping_file
if mapping_file contains \r that new_file will also contain it (as the last char) and also in your grep you will look for a filename containing that \r. Obviously that does not exist, grep will fail.
MichaelK - That would make sense that the issue with the end of line character is more to do with the read command. I have never experienced this problem before, but I don't think that I ever used the read command before that I can recall.
BoughtonP - Well perhaps perfect was an exaggeration. The point I was making was that I was very pleased to finally get the script doing what I was expecting it to do. I have only tested a small sample (which did work - dare I say it - perfectly). But I was waiting for more info on this end of line character issue before doing more testing on other domain files. Can't see what bug you mean. I have fixed the "," and make it ";" if that's what you mean.
Pan64 I will make sure that the mapping_file is converted to Linux end of character format. So it won't contain \r. I will do more testing on this because I have not experienced end of line issues before so this is new to me.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.