LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 07-22-2011, 10:15 AM   #1
guessity
Member
 
Registered: Dec 2009
Posts: 41

Rep: Reputation: 15
Question How to remove duplicate words within a particular text in a file?


I am basically trying to remove duplicate words in my <title></title> tag after I got hit by Google Panda. I have around 750 .html files and it will be difficult for to me remove one by one. I am looking for a way to remove only from within <title> </title>

Example of a duplicate title I have:
Code:
<title>Pasta, Pasta Recipe and Pasta Guide</title>
Can someone suggest me a way to do it? I dont want to replace those words anywhere else in the file except for within the <title>
 
Old 07-22-2011, 11:13 AM   #2
MTK358
LQ 5k Club
 
Registered: Sep 2009
Posts: 6,443
Blog Entries: 3

Rep: Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721
Assuming you want to keep the tags and remove the text inside them:

Code:
sed -i 's@<title>Pasta, Pasta Recipe and Pasta Guide</title>@<title></title>@' *.html
Note that it's probably a good idea to back up the HTML files, because you could easily accidentally ruin them with commands like this.
 
Old 07-22-2011, 11:24 AM   #3
guessity
Member
 
Registered: Dec 2009
Posts: 41

Original Poster
Rep: Reputation: 15
Thanks for your reply. I basically just want Pasta to appear once in the <title> and wanted to remove the other two pasta reference in my title. My files have a similar pattern for other dishes too.
 
Old 07-22-2011, 12:22 PM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,565

Rep: Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901
So then the question is ... which Pasta to keep? ie show us what the output would be after running the command on the example you have provided.
 
Old 07-22-2011, 01:53 PM   #5
guessity
Member
 
Registered: Dec 2009
Posts: 41

Original Poster
Rep: Reputation: 15
Code:
<title>Pasta, Recipe and Guide</title>
I am looking for this output.
 
Old 07-22-2011, 02:04 PM   #6
MTK358
LQ 5k Club
 
Registered: Sep 2009
Posts: 6,443
Blog Entries: 3

Rep: Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721
Is "pasta" the only word? Or are there other ones?

Do you want it to make it so that there's only one of each word (extremely complicated to do), or one of the word "pasta" (very simple to do)?

For example, if you had an HTML file with this title:

Code:
dsfas fgryr sdfgamrtf test gasdfig test fsgksdfg
Do you want it unchanged, or do you want this output:

Code:
dsfas fgryr sdfgamrtf test gasdfig fsgksdfg

Last edited by MTK358; 07-22-2011 at 02:07 PM.
 
Old 07-22-2011, 02:14 PM   #7
guessity
Member
 
Registered: Dec 2009
Posts: 41

Original Poster
Rep: Reputation: 15
Actually I have 700 files of various dishes and which means Pasta was an example. Rest of the words remain same. I just want țo remove the dish names which I got them repeated.

s it possible to remove words from , till </title> ,
 
Old 07-22-2011, 02:19 PM   #8
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,825
Blog Entries: 1

Rep: Reputation: 1221Reputation: 1221Reputation: 1221Reputation: 1221Reputation: 1221Reputation: 1221Reputation: 1221Reputation: 1221Reputation: 1221
Try the following:
Code:
sed 's/<title>\([A-Z][a-z]*\), \1 \([A-Z][a-z]*\) and \1 \([A-Z][a-z]*\)<\/title>/<title>\1, \2 and \3<\/title>/g' file

Code:
sycamorex@mordor:~/temp$ cat file
<title>Pasta, Pasta Recipe and Pasta Guide</title>
<title>Soup, Soup Recipe and Soup Guide</title>
<title>Beans, Beans Recipe and Beans Guide</title>
<title>Cakes, Cakes Recipe and Cakes Guide</title>
sycamorex@mordor:~/temp$ sed 's/<title>\([A-Z][a-z]*\), \1 \([A-Z][a-z]*\) and \1 \([A-Z][a-z]*\)<\/title>/<title>\1, \2 and \3<\/title>/g' file
<title>Pasta, Recipe and Guide</title>
<title>Soup, Recipe and Guide</title>
<title>Beans, Recipe and Guide</title>
<title>Cakes, Recipe and Guide</title>

Last edited by sycamorex; 07-22-2011 at 02:21 PM.
 
1 members found this post helpful.
Old 07-22-2011, 02:23 PM   #9
MTK358
LQ 5k Club
 
Registered: Sep 2009
Posts: 6,443
Blog Entries: 3

Rep: Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721
Quote:
Originally Posted by guessity View Post
Actually I have 700 files of various dishes and which means Pasta was an example. Rest of the words remain same. I just want țo remove the dish names which I got them repeated.

s it possible to remove words from , till </title> ,
So for every HTML file, check if the title contains more than one instance of a certain word, and if so, remove all but the first instance of that word. Is that correct?
 
Old 07-22-2011, 02:30 PM   #10
guessity
Member
 
Registered: Dec 2009
Posts: 41

Original Poster
Rep: Reputation: 15
Thumbs up

Quote:
Originally Posted by MTK358 View Post
So for every HTML file, check if the title contains more than one instance of a certain word, and if so, remove all but the first instance of that word. Is that correct?
Yes.
 
Old 07-22-2011, 02:44 PM   #11
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,825
Blog Entries: 1

Rep: Reputation: 1221Reputation: 1221Reputation: 1221Reputation: 1221Reputation: 1221Reputation: 1221Reputation: 1221Reputation: 1221Reputation: 1221
Perhaps, it'll be easier if you post all the lines containing the <title>. I know it'll be over 700 lines but it'll give us a better picture:
Code:
grep -h "<title>" *.html
 
Old 07-22-2011, 02:55 PM   #12
guessity
Member
 
Registered: Dec 2009
Posts: 41

Original Poster
Rep: Reputation: 15
I will post a few. Should I post all of it?

Code:
 <TITLE>Jaggery Dosa, Jaggery Dosa Recipe and Jaggery Dosa Guide</TITLE>
   <TITLE>Jaggery Lemon Ginger Juice, Jaggery Lemon Ginger Juice Recipe and Jagg                                                                             ery Lemon Ginger Juice Guide</TITLE>
   <TITLE>Jalebies, Jalebies Recipe and Jalebies Guide</TITLE>
   <TITLE>Jam Biscuits, Jam Biscuits Recipe and Jam Biscuits Guide</TITLE>
   <TITLE>Jeera Rice, Jeera Rice Recipe and Jeera Rice Guide</TITLE>
   <TITLE>Jhunka, Jhunka Recipe and Jhunka Guide</TITLE>
   <TITLE>Kachori, Kachori Recipe and Kachori Guide</TITLE>
   <TITLE>Kadabus, Kadabus Recipe and Kadabus Guide</TITLE>
   <TITLE>Kadagu Puli, Kadagu Puli Recipe and Kadagu Puli Guide</TITLE>
   <TITLE>Kadamba Chutney, Kadamba Chutney Recipe and Kadamba Chutney Guide</TIT                                                                             LE>
   <TITLE>Kadi, Kadi Recipe and Kadi Guide</TITLE>
   <TITLE>Kaju Curd Delight, Kaju Curd Delight Recipe and Kaju Curd Delight Guid                                                                             e</TITLE>
   <TITLE>Kakra, Kakra Recipe and Kakra Guide</TITLE>
   <TITLE>Kalakand, Kalakand Recipe and Kalakand Guide</TITLE>
   <TITLE>Kancheepuram Idli, Kancheepuram Idli Recipe and Kancheepuram Idli Guid                                                                             e</TITLE>
   <TITLE>Kanji Vada, Kanji Vada Recipe and Kanji Vada Guide</TITLE>
   <TITLE>Karanji, Karanji Recipe and Karanji Guide</TITLE>
   <TITLE>Karela, Karela Recipe and Karela Guide</TITLE>
   <TITLE>Karuvaeppilai Kuzhambu, Karuvaeppilai Kuzhambu Recipe and Karuvaeppila                                                                             i Kuzhambu Guide</TITLE>
   <TITLE>Kasuri Paneer, Kasuri Paneer Recipe and Kasuri Paneer Guide</TITLE>
   <TITLE>Kathirikkai Gothsu, Kathirikkai Gothsu Recipe and Kathirikkai Gothsu G                                                                             uide</TITLE>
   <TITLE>Kesar Jalebi, Kesar Jalebi Recipe and Kesar Jalebi Guide</TITLE>
   <TITLE>Kesar Pista Kulfi, Kesar Pista Kulfi Recipe and Kesar Pista Kulfi Guid                                                                             e</TITLE>
   <TITLE>Khajoor Chutney, Khajoor Chutney Recipe and Khajoor Chutney Guide</TIT                                                                             LE>
   <TITLE>Khaman Dhokla, Khaman Dhokla Recipe and Khaman Dhokla Guide</TITLE>
   <TITLE>Khandvi, Khandvi Recipe and Khandvi Guide</TITLE>
   <TITLE>Khara Pongal, Khara Pongal Recipe and Khara Pongal Guide</TITLE>
   <TITLE>Khatta Aloo, Khatta Aloo Recipe and Khatta Aloo Guide</TITLE>
   <TITLE>Kobbari Gojju, Kobbari Gojju Recipe and Kobbari Gojju Guide</TITLE>
   <TITLE>Kodubale, Kodubale Recipe and Kodubale Guide</TITLE>
   <TITLE>Kofta Curry, Kofta Curry Recipe and Kofta Curry Guide</TITLE>
   <TITLE>Kofta In Gravy, Kofta In Gravy Recipe and Kofta In Gravy Guide</TITLE>
   <TITLE>Kofta Tikka Masala, Kofta Tikka Masala Recipe and Kofta Tikka Masala G                                                                             uide</TITLE>
   <TITLE>Koottu Curry, Koottu Curry Recipe and Koottu Curry Guide</TITLE>
   <TITLE>Kootu, Kootu Recipe and Kootu Guide</TITLE>
 
Old 07-22-2011, 02:59 PM   #13
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,565

Rep: Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901Reputation: 2901
I am not sure that a quick and dirty one liner will do it. As you are parsing html it would probably be best to use something like Perl.

However, if the pattern were to be the same and the first word is always the one to be replaced and we don't have to worry about case sensitivity (ie it will always be the same).
Something like this might work:
Code:
awk 'BEGIN{FS = "[, ]"}$1 ~ /title/{split($1,a,">");for(i=2;i<=NF;i++)if($i == a[2])$i = "";gsub(/ +/," ");$1=$1","}1' file
The difference with the sed offered is that if other words are available (ie not just recipe and guide) this one will capture them to.

I think sycamorex's suggestion above is still good though.
 
1 members found this post helpful.
Old 07-22-2011, 03:03 PM   #14
MTK358
LQ 5k Club
 
Registered: Sep 2009
Posts: 6,443
Blog Entries: 3

Rep: Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721
Code:
sed -ri 's@<TITLE>([^,]+), \1 Recipe and \1 Guide</TITLE>@<TITLE>\1, Recipe and Guide</TITLE>@g' *.html
Also, you didn't mention the very important fact that the <TITLE> tags are uppercase (and note that it's preferred for HTML tags to be lowercase, and AFAIK newer versions of HTML will only support lowercase tags).

Last edited by MTK358; 07-22-2011 at 03:05 PM.
 
1 members found this post helpful.
Old 07-22-2011, 03:10 PM   #15
guessity
Member
 
Registered: Dec 2009
Posts: 41

Original Poster
Rep: Reputation: 15
MTK358, grail and sycamorex I was able to fix it. Appreciate for all your help.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Counting words in an ASCII text file. stf92 Slackware 8 06-19-2011 05:06 PM
How To Remove Words/Text In A Directory? prayingtosky Linux - Newbie 14 02-25-2011 04:08 PM
how to remove words in a file using sed or any other command Kilam orez Linux - Newbie 4 11-30-2009 09:52 AM
Command to delete words out of a text file. Dazamondo Linux - Newbie 13 06-30-2009 11:48 AM
Replacing words in a text file Raghavan_sat Programming 3 05-27-2008 04:11 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 01:59 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration