How to remove duplicate words within a particular text in a file?
I am basically trying to remove duplicate words in my <title></title> tag after I got hit by Google Panda. I have around 750 .html files and it will be difficult for to me remove one by one. I am looking for a way to remove only from within <title> </title>
Example of a duplicate title I have: Code:
<title>Pasta, Pasta Recipe and Pasta Guide</title> |
Assuming you want to keep the tags and remove the text inside them:
Code:
sed -i 's@<title>Pasta, Pasta Recipe and Pasta Guide</title>@<title></title>@' *.html |
Thanks for your reply. I basically just want Pasta to appear once in the <title> and wanted to remove the other two pasta reference in my title. My files have a similar pattern for other dishes too.
|
So then the question is ... which Pasta to keep? ie show us what the output would be after running the command on the example you have provided.
|
Code:
<title>Pasta, Recipe and Guide</title> |
Is "pasta" the only word? Or are there other ones?
Do you want it to make it so that there's only one of each word (extremely complicated to do), or one of the word "pasta" (very simple to do)? For example, if you had an HTML file with this title: Code:
dsfas fgryr sdfgamrtf test gasdfig test fsgksdfg Code:
dsfas fgryr sdfgamrtf test gasdfig fsgksdfg |
Actually I have 700 files of various dishes and which means Pasta was an example. Rest of the words remain same. I just want țo remove the dish names which I got them repeated.
Īs it possible to remove words from , till </title> , |
Try the following:
Code:
sed 's/<title>\([A-Z][a-z]*\), \1 \([A-Z][a-z]*\) and \1 \([A-Z][a-z]*\)<\/title>/<title>\1, \2 and \3<\/title>/g' file Code:
sycamorex@mordor:~/temp$ cat file |
Quote:
|
Quote:
|
Perhaps, it'll be easier if you post all the lines containing the <title>. I know it'll be over 700 lines but it'll give us a better picture:
Code:
grep -h "<title>" *.html |
I will post a few. Should I post all of it?
Code:
<TITLE>Jaggery Dosa, Jaggery Dosa Recipe and Jaggery Dosa Guide</TITLE> |
I am not sure that a quick and dirty one liner will do it. As you are parsing html it would probably be best to use something like Perl.
However, if the pattern were to be the same and the first word is always the one to be replaced and we don't have to worry about case sensitivity (ie it will always be the same). Something like this might work: Code:
awk 'BEGIN{FS = "[, ]"}$1 ~ /title/{split($1,a,">");for(i=2;i<=NF;i++)if($i == a[2])$i = "";gsub(/ +/," ");$1=$1","}1' file I think sycamorex's suggestion above is still good though. |
Code:
sed -ri 's@<TITLE>([^,]+), \1 Recipe and \1 Guide</TITLE>@<TITLE>\1, Recipe and Guide</TITLE>@g' *.html |
MTK358, grail and sycamorex I was able to fix it. Appreciate for all your help.
|
All times are GMT -5. The time now is 01:11 PM. |