LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   How to remove duplicate words within a particular text in a file? (https://www.linuxquestions.org/questions/linux-newbie-8/how-to-remove-duplicate-words-within-a-particular-text-in-a-file-893154/)

guessity 07-22-2011 09:15 AM

How to remove duplicate words within a particular text in a file?
 
I am basically trying to remove duplicate words in my <title></title> tag after I got hit by Google Panda. I have around 750 .html files and it will be difficult for to me remove one by one. I am looking for a way to remove only from within <title> </title>

Example of a duplicate title I have:
Code:

<title>Pasta, Pasta Recipe and Pasta Guide</title>
Can someone suggest me a way to do it? I dont want to replace those words anywhere else in the file except for within the <title>

MTK358 07-22-2011 10:13 AM

Assuming you want to keep the tags and remove the text inside them:

Code:

sed -i 's@<title>Pasta, Pasta Recipe and Pasta Guide</title>@<title></title>@' *.html
Note that it's probably a good idea to back up the HTML files, because you could easily accidentally ruin them with commands like this.

guessity 07-22-2011 10:24 AM

Thanks for your reply. I basically just want Pasta to appear once in the <title> and wanted to remove the other two pasta reference in my title. My files have a similar pattern for other dishes too.

grail 07-22-2011 11:22 AM

So then the question is ... which Pasta to keep? ie show us what the output would be after running the command on the example you have provided.

guessity 07-22-2011 12:53 PM

Code:

<title>Pasta, Recipe and Guide</title>
I am looking for this output.

MTK358 07-22-2011 01:04 PM

Is "pasta" the only word? Or are there other ones?

Do you want it to make it so that there's only one of each word (extremely complicated to do), or one of the word "pasta" (very simple to do)?

For example, if you had an HTML file with this title:

Code:

dsfas fgryr sdfgamrtf test gasdfig test fsgksdfg
Do you want it unchanged, or do you want this output:

Code:

dsfas fgryr sdfgamrtf test gasdfig fsgksdfg

guessity 07-22-2011 01:14 PM

Actually I have 700 files of various dishes and which means Pasta was an example. Rest of the words remain same. I just want țo remove the dish names which I got them repeated.

Īs it possible to remove words from , till </title> ,

sycamorex 07-22-2011 01:19 PM

Try the following:
Code:

sed 's/<title>\([A-Z][a-z]*\), \1 \([A-Z][a-z]*\) and \1 \([A-Z][a-z]*\)<\/title>/<title>\1, \2 and \3<\/title>/g' file

Code:

sycamorex@mordor:~/temp$ cat file
<title>Pasta, Pasta Recipe and Pasta Guide</title>
<title>Soup, Soup Recipe and Soup Guide</title>
<title>Beans, Beans Recipe and Beans Guide</title>
<title>Cakes, Cakes Recipe and Cakes Guide</title>
sycamorex@mordor:~/temp$ sed 's/<title>\([A-Z][a-z]*\), \1 \([A-Z][a-z]*\) and \1 \([A-Z][a-z]*\)<\/title>/<title>\1, \2 and \3<\/title>/g' file
<title>Pasta, Recipe and Guide</title>
<title>Soup, Recipe and Guide</title>
<title>Beans, Recipe and Guide</title>
<title>Cakes, Recipe and Guide</title>


MTK358 07-22-2011 01:23 PM

Quote:

Originally Posted by guessity (Post 4422582)
Actually I have 700 files of various dishes and which means Pasta was an example. Rest of the words remain same. I just want țo remove the dish names which I got them repeated.

Īs it possible to remove words from , till </title> ,

So for every HTML file, check if the title contains more than one instance of a certain word, and if so, remove all but the first instance of that word. Is that correct?

guessity 07-22-2011 01:30 PM

Quote:

Originally Posted by MTK358 (Post 4422591)
So for every HTML file, check if the title contains more than one instance of a certain word, and if so, remove all but the first instance of that word. Is that correct?

Yes.

sycamorex 07-22-2011 01:44 PM

Perhaps, it'll be easier if you post all the lines containing the <title>. I know it'll be over 700 lines but it'll give us a better picture:
Code:

grep -h "<title>" *.html

guessity 07-22-2011 01:55 PM

I will post a few. Should I post all of it?

Code:

<TITLE>Jaggery Dosa, Jaggery Dosa Recipe and Jaggery Dosa Guide</TITLE>
  <TITLE>Jaggery Lemon Ginger Juice, Jaggery Lemon Ginger Juice Recipe and Jagg                                                                            ery Lemon Ginger Juice Guide</TITLE>
  <TITLE>Jalebies, Jalebies Recipe and Jalebies Guide</TITLE>
  <TITLE>Jam Biscuits, Jam Biscuits Recipe and Jam Biscuits Guide</TITLE>
  <TITLE>Jeera Rice, Jeera Rice Recipe and Jeera Rice Guide</TITLE>
  <TITLE>Jhunka, Jhunka Recipe and Jhunka Guide</TITLE>
  <TITLE>Kachori, Kachori Recipe and Kachori Guide</TITLE>
  <TITLE>Kadabus, Kadabus Recipe and Kadabus Guide</TITLE>
  <TITLE>Kadagu Puli, Kadagu Puli Recipe and Kadagu Puli Guide</TITLE>
  <TITLE>Kadamba Chutney, Kadamba Chutney Recipe and Kadamba Chutney Guide</TIT                                                                            LE>
  <TITLE>Kadi, Kadi Recipe and Kadi Guide</TITLE>
  <TITLE>Kaju Curd Delight, Kaju Curd Delight Recipe and Kaju Curd Delight Guid                                                                            e</TITLE>
  <TITLE>Kakra, Kakra Recipe and Kakra Guide</TITLE>
  <TITLE>Kalakand, Kalakand Recipe and Kalakand Guide</TITLE>
  <TITLE>Kancheepuram Idli, Kancheepuram Idli Recipe and Kancheepuram Idli Guid                                                                            e</TITLE>
  <TITLE>Kanji Vada, Kanji Vada Recipe and Kanji Vada Guide</TITLE>
  <TITLE>Karanji, Karanji Recipe and Karanji Guide</TITLE>
  <TITLE>Karela, Karela Recipe and Karela Guide</TITLE>
  <TITLE>Karuvaeppilai Kuzhambu, Karuvaeppilai Kuzhambu Recipe and Karuvaeppila                                                                            i Kuzhambu Guide</TITLE>
  <TITLE>Kasuri Paneer, Kasuri Paneer Recipe and Kasuri Paneer Guide</TITLE>
  <TITLE>Kathirikkai Gothsu, Kathirikkai Gothsu Recipe and Kathirikkai Gothsu G                                                                            uide</TITLE>
  <TITLE>Kesar Jalebi, Kesar Jalebi Recipe and Kesar Jalebi Guide</TITLE>
  <TITLE>Kesar Pista Kulfi, Kesar Pista Kulfi Recipe and Kesar Pista Kulfi Guid                                                                            e</TITLE>
  <TITLE>Khajoor Chutney, Khajoor Chutney Recipe and Khajoor Chutney Guide</TIT                                                                            LE>
  <TITLE>Khaman Dhokla, Khaman Dhokla Recipe and Khaman Dhokla Guide</TITLE>
  <TITLE>Khandvi, Khandvi Recipe and Khandvi Guide</TITLE>
  <TITLE>Khara Pongal, Khara Pongal Recipe and Khara Pongal Guide</TITLE>
  <TITLE>Khatta Aloo, Khatta Aloo Recipe and Khatta Aloo Guide</TITLE>
  <TITLE>Kobbari Gojju, Kobbari Gojju Recipe and Kobbari Gojju Guide</TITLE>
  <TITLE>Kodubale, Kodubale Recipe and Kodubale Guide</TITLE>
  <TITLE>Kofta Curry, Kofta Curry Recipe and Kofta Curry Guide</TITLE>
  <TITLE>Kofta In Gravy, Kofta In Gravy Recipe and Kofta In Gravy Guide</TITLE>
  <TITLE>Kofta Tikka Masala, Kofta Tikka Masala Recipe and Kofta Tikka Masala G                                                                            uide</TITLE>
  <TITLE>Koottu Curry, Koottu Curry Recipe and Koottu Curry Guide</TITLE>
  <TITLE>Kootu, Kootu Recipe and Kootu Guide</TITLE>


grail 07-22-2011 01:59 PM

I am not sure that a quick and dirty one liner will do it. As you are parsing html it would probably be best to use something like Perl.

However, if the pattern were to be the same and the first word is always the one to be replaced and we don't have to worry about case sensitivity (ie it will always be the same).
Something like this might work:
Code:

awk 'BEGIN{FS = "[, ]"}$1 ~ /title/{split($1,a,">");for(i=2;i<=NF;i++)if($i == a[2])$i = "";gsub(/ +/," ");$1=$1","}1' file
The difference with the sed offered is that if other words are available (ie not just recipe and guide) this one will capture them to.

I think sycamorex's suggestion above is still good though.

MTK358 07-22-2011 02:03 PM

Code:

sed -ri 's@<TITLE>([^,]+), \1 Recipe and \1 Guide</TITLE>@<TITLE>\1, Recipe and Guide</TITLE>@g' *.html
Also, you didn't mention the very important fact that the <TITLE> tags are uppercase (and note that it's preferred for HTML tags to be lowercase, and AFAIK newer versions of HTML will only support lowercase tags).

guessity 07-22-2011 02:10 PM

MTK358, grail and sycamorex I was able to fix it. Appreciate for all your help.


All times are GMT -5. The time now is 01:11 PM.