[SOLVED] Difficulty cleaning references to duplicated images in HTML code

mdart · 01-30-2013, 12:16 PM

Hi,

I need to search and replace references to duplicated images in HTML code. There are several groups of duplicated images, which are visually the same, but with different filenames. I managed to find the duplicated files themselves, but now I need to clean the code too. I have a CSV file with each group of duplicated images organized:

Code:

Group ID,Duplicated image filename, Number of duplicates
0,13429.png,3 
0,18064.png,3
0,25025.png,3
1,14136.png,4
1,17382.png,4
1,19243.png,4
1,25389.png,4
2,21560.png,2
2,5529.png,2
3,3523.png,2
3,4811.png,2

and so on...

The references to duplicated images are scattered throughout hundreds of HTML files. The task is to get the <img> tags that references duplicates pointing to just one unique image in each group. I'm wondering if some script magic could get it done easily.

HTML (before): different files, same visual appearance

Code:

<!-- group 0 -->
<img src="13429.png" />...text...<img src="18064.png" />...text...<img src="18064.png" />

<!-- group 1 -->
<img src="14136.png" />...text...<img src="17382.png" />...text...<img src="19243.png" />...text...<img src="25389.png" />

<!-- group 2 -->
<img src="21560.png" />...text...<img src="5529.png" />

HTML (after): unique file in each group

Code:

<!-- group 0 -->
<img src="13429.png" />...text...<img src="13429.png" />...text...<img src="13429.png" />

<!-- group 1 -->
<img src="14136.png" />...text...<img src="14136.png" />...text...<img src="14136.png" />...text...<img src="14136.png" />

<!-- group 2 -->
<img src="21560.png" />...text...<img src="21560.png" />

I searched for some solutions here in the forum, with no sucess.

Any help you can give would be greatly appreciated.

mdart · 01-31-2013, 05:31 AM

I managed to find a solution with awk:

Code:

awk -F, 'NR==FNR {Ar[$1]=Ar[$1](Ar[$1]?"|":"")$2;
                  if (!Rr[$1])Rr[$1]=$2; next}
         {for (i in Ar) gsub (Ar[i], Rr[i])}
         {print >> "results.html"}
        ' duplicates.csv *.html

This compares each html against the criteria on the CSV file then writes the results in a single, new file.

But a new question arises: is there any way to save the results in separate files, following the original?

I appreciate if someone can help me solve this new problem!

ntubski · 02-03-2013, 10:59 AM

Something like:

Code:

  {print >> ("result-" FILENAME) }
  # instead of print >> "results.html"

mdart · 02-03-2013, 12:36 PM

Quote:

Originally Posted by ntubski

Something like:

Code:

  {print >> ("result-" FILENAME) }
  # instead of print >> "results.html"

Thanks ntubski, this worked great.

Is there any way to delete the original files and remove the "result-" part in the new file's names, inside this same script? I'd like to replace the old files with the results.

ntubski · 02-03-2013, 01:06 PM

Just rename back to original:

Code:

awk '...{print >> ("result-" FILENAME) }...' duplicates.csv *.html
# Note: this assumes none of the original files had "result-" as a prefix.
for result in result-*.html ; do
    mv "$result" "${result#result-}"
done

mdart · 02-03-2013, 01:14 PM

Brillhant. Thanks again for your help! Here's the complete code:

Code:

awk -F, 'NR==FNR {Ar[$1]=Ar[$1](Ar[$1]?"|":"")$2;
     if (!Rr[$1])Rr[$1]=$2; next}
     {for (i in Ar) gsub (Ar[i], Rr[i])}
     {print >> ("result-" FILENAME) }
     ' dupes.csv *.xhtml

for result in result-*.xhtml ; do
    mv "$result" "${result#result-}"
done

ntubski · 02-05-2013, 10:10 PM

Quote:

Originally Posted by ntubski

# Note: this assumes none of the original files had "result-" as a prefix.

It occurs to me that using a suffix instead of a prefix would avoid that issue:

Code:

awk '...{print >> (FILENAME ".new") }...' duplicates.csv *.html
for result in *.html.new ; do
    mv "$result" "${result%.new}"
done

mdart · 02-06-2013, 05:05 AM

Nice improvement, ntubski, thanks a lot!