LinuxQuestions.org
Go Job Hunting at the LQ Job Marketplace
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 01-30-2013, 12:16 PM   #1
mdart
LQ Newbie
 
Registered: Jan 2013
Location: Brazil
Posts: 5

Rep: Reputation: Disabled
Difficulty cleaning references to duplicated images in HTML code


Hi,

I need to search and replace references to duplicated images in HTML code. There are several groups of duplicated images, which are visually the same, but with different filenames. I managed to find the duplicated files themselves, but now I need to clean the code too. I have a CSV file with each group of duplicated images organized:

Code:
Group ID,Duplicated image filename, Number of duplicates
0,13429.png,3 
0,18064.png,3
0,25025.png,3
1,14136.png,4
1,17382.png,4
1,19243.png,4
1,25389.png,4
2,21560.png,2
2,5529.png,2
3,3523.png,2
3,4811.png,2
and so on...

The references to duplicated images are scattered throughout hundreds of HTML files. The task is to get the <img> tags that references duplicates pointing to just one unique image in each group. I'm wondering if some script magic could get it done easily.

HTML (before): different files, same visual appearance
Code:
<!-- group 0 -->
<img src="13429.png" />...text...<img src="18064.png" />...text...<img src="18064.png" />

<!-- group 1 -->
<img src="14136.png" />...text...<img src="17382.png" />...text...<img src="19243.png" />...text...<img src="25389.png" />

<!-- group 2 -->
<img src="21560.png" />...text...<img src="5529.png" />

HTML (after): unique file in each group
Code:
<!-- group 0 -->
<img src="13429.png" />...text...<img src="13429.png" />...text...<img src="13429.png" />

<!-- group 1 -->
<img src="14136.png" />...text...<img src="14136.png" />...text...<img src="14136.png" />...text...<img src="14136.png" />

<!-- group 2 -->
<img src="21560.png" />...text...<img src="21560.png" />
I searched for some solutions here in the forum, with no sucess.

Any help you can give would be greatly appreciated.
 
Old 01-31-2013, 05:31 AM   #2
mdart
LQ Newbie
 
Registered: Jan 2013
Location: Brazil
Posts: 5

Original Poster
Rep: Reputation: Disabled
I managed to find a solution with awk:

Code:
awk -F, 'NR==FNR {Ar[$1]=Ar[$1](Ar[$1]?"|":"")$2;
                  if (!Rr[$1])Rr[$1]=$2; next}
         {for (i in Ar) gsub (Ar[i], Rr[i])}
         {print >> "results.html"}
        ' duplicates.csv *.html
This compares each html against the criteria on the CSV file then writes the results in a single, new file.

But a new question arises: is there any way to save the results in separate files, following the original?

I appreciate if someone can help me solve this new problem!
 
Old 02-03-2013, 10:59 AM   #3
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,396

Rep: Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814
Something like:
Code:
  {print >> ("result-" FILENAME) }
  # instead of print >> "results.html"
 
1 members found this post helpful.
Old 02-03-2013, 12:36 PM   #4
mdart
LQ Newbie
 
Registered: Jan 2013
Location: Brazil
Posts: 5

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by ntubski View Post
Something like:
Code:
  {print >> ("result-" FILENAME) }
  # instead of print >> "results.html"
Thanks ntubski, this worked great.

Is there any way to delete the original files and remove the "result-" part in the new file's names, inside this same script? I'd like to replace the old files with the results.
 
Old 02-03-2013, 01:06 PM   #5
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,396

Rep: Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814
Just rename back to original:
Code:
awk '...{print >> ("result-" FILENAME) }...' duplicates.csv *.html
# Note: this assumes none of the original files had "result-" as a prefix.
for result in result-*.html ; do
    mv "$result" "${result#result-}"
done
 
1 members found this post helpful.
Old 02-03-2013, 01:14 PM   #6
mdart
LQ Newbie
 
Registered: Jan 2013
Location: Brazil
Posts: 5

Original Poster
Rep: Reputation: Disabled
Brillhant. Thanks again for your help! Here's the complete code:

Code:
awk -F, 'NR==FNR {Ar[$1]=Ar[$1](Ar[$1]?"|":"")$2;
     if (!Rr[$1])Rr[$1]=$2; next}
     {for (i in Ar) gsub (Ar[i], Rr[i])}
     {print >> ("result-" FILENAME) }
     ' dupes.csv *.xhtml

for result in result-*.xhtml ; do
    mv "$result" "${result#result-}"
done
 
Old 02-05-2013, 10:10 PM   #7
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,396

Rep: Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814
Quote:
Originally Posted by ntubski View Post
# Note: this assumes none of the original files had "result-" as a prefix.
It occurs to me that using a suffix instead of a prefix would avoid that issue:
Code:
awk '...{print >> (FILENAME ".new") }...' duplicates.csv *.html
for result in *.html.new ; do
    mv "$result" "${result%.new}"
done
 
Old 02-06-2013, 05:05 AM   #8
mdart
LQ Newbie
 
Registered: Jan 2013
Location: Brazil
Posts: 5

Original Poster
Rep: Reputation: Disabled
Nice improvement, ntubski, thanks a lot!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Apache VirtualHost questions -- allowing images references by url link Jctop Linux - Server 1 01-07-2013 05:36 PM
cleaning up text from html in vi stardotstar Linux - Software 3 04-14-2006 05:37 AM
User Preferences: Use HTML code instead of vB code? (vB code is overrated) stefanlasiewski LQ Suggestions & Feedback 5 07-26-2005 01:37 AM
C code outputs pointer references not values caged Programming 2 06-13-2005 05:24 PM
Link references in html printings fiomba Programming 2 01-27-2005 11:57 AM


All times are GMT -5. The time now is 10:57 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration