[SOLVED] how to change the content of lines in an html file using regex/grep

Firerat · 10-18-2019, 02:03 PM

which also takes us back to @pan64's mention of wget mirror mode

toothwright · 10-21-2019, 05:35 AM

@Firerat and scasey

I would argue that the reason for needing local copies is evident, though, in fairness, maybe not to non-dance playing amateur musicians.

Musicians playing for folk dancing often used to to need to carry bags of sheet music so that they could comply with various callers requirements for tune sets for dances.

This, for many, is not now necessary with digital scores being more convenient.

So, the answer to 'Why ?' is because it is not usual to have a reliable internet connection when you are sitting on a farmers wagon in a barn playing for a dance.

scasey · 10-21-2019, 05:53 AM

Quote:

Originally Posted by toothwright

@scasey and others

I would argue that the reason is evident, though in fairness may be not to non dance playing amateur musicians.

Musicians playing for folk dancing often used to to carry bags of sheet music so that they could comply with the callers wishes.
This is not now necessary, digital scores are more convenient.

So, the answer to 'Why not leave it as it is?' is because it is not usual to have an internet connection when you are sitting on a farmers wagon in a barn-dance.

Aha! So, You have copied the scores to your laptop and now want to be able to display them. So why use a web page to do that? Just open them with the file browser using the PDF viewer. Mayhaps you’re working harder than you need to.

The musicians I know and play with are very comfortable with their three ring binders that contain only lyrics.. Most of them don’t even read music, and a couple are complete Luddites who wouldn’t know how to turn a laptop on...

toothwright · 10-21-2019, 06:07 AM

Also: many talented amateur dance musicians seem to have an almost infinite memory for scores; others require an android pad.

The database is not as straight forward as it appears. Some entries bring up whole sets, many others are links to individual tunes.

I like to use HTML to arrange the order that I want. Just the browser does not allow this personalisation.

pan64 · 10-21-2019, 06:10 AM

Hm. If I understand well you want to edit that html? Browsers usually open the pages in read-only mode, but obviously you can edit your own files (using a html editor).

scasey · 10-21-2019, 06:53 AM

Quote:

Originally Posted by toothwright

Also: many talented amateur dance musicians seem to have an almost infinite memory for scores; others require an android pad.

The database is not as straight forward as it appears. Some entries bring up whole sets, many others are links to individual tunes.

I like to use HTML to arrange the order that I want. Just the browser does not allow this personalisation.

OK. I get that.
But I don’t how you would do that by changing “the content of lines in an html file using regexp...” I think you’ll just need to sit down and code it.
SciTE has a copy function: ctrl+d will duplicate a line (or what is selected), then the copy can be edited to change the page/document being linked to.

Hopefully you’ve been working on that since you posted in #22 and are almost done...

toothwright · 10-21-2019, 06:59 AM

@pan64
Yes, that is exactly what I am doing, (using bluefish).

My original question resulted from a hope that automation of the repetitive parts of the edit to reduce manual interference would be possible.

It has proved difficult to analyse the HTML lines I need to change in order to develop a working regex so I have returned to manual edits - 1000 lines to go!.

pan64 · 10-21-2019, 07:15 AM

again, wget can download a page and all the links contained on that page, store on the local disk/pendrive and also rewrite the links [on the donwloaded page] to use the downloaded files.
https://superuser.com/questions/8006...asnt-specified

from the other hand you only need a bulk search/replace to modify the original url, do not need to do it line by line.

ondoho · 10-22-2019, 04:00 AM

Are the files named in a searchable, recognizable manner?
Maybe you can simply use something like fzf (fuzzy find):

Code:

mupdf "$(find "local directory" -iname '*pdf'|fzf)"

toothwright · 10-22-2019, 10:46 AM

I started with an original line:

<li>Ashokan farewell<a href="http://www.xx.co.uk/mm/M0245_Ashokan_farewell.htm">MM245</a></li>

Used SciTE (as scasey suggested) to produce :

<li>Ashokan farewell <a href="Tunes/MM0245_Ashokan_farewell.pdf">MM245</a></li>

which is how the file looks now.

The final effort should look like this (done manually):

<li><a href="Tunes/MM0245_Ashokan_farewell.pdf">Ashokan farewell</a></li>

So, in the SciTE processed line for example, I would like to exchange the tune name "Ashokan farewell" with "MM245"
Problem is that both items vary in length in the tuples of the file and the names do not always use the UK font....this is why I'm stuck....
I'll explore fzf next, thank you for the example

Firerat · 10-22-2019, 11:30 AM

this

Code:

sed 's@\(<li>\)\(.*\+\)<a href="http.*/\(.*\+\).htm.*@\1<a href="Tunes/M\3.pdf">\2</a></li>@'

does what you want to

Code:

<li>Ashokan farewell<a href="http://www.xx.co.uk/mm/M0245_Ashokan_farewell.htm">MM245</a></li>

however, I did hardcode the extra M
more work would be required if MM needs to replace M

it is difficult to find a pattern in one example

I suppose it could be the UPPPER of mm in co.uk/mm/

input

Code:

<ul>
<li>Beans <a href="http://www.anything.co.uk/Contra_reels.htm">MM339</a></li>
<li>Beaulieu <a href="http://www.anything.co.uk/French_Canadian_reels.htm">MM223</a></li>
<li>Beaumont rag <a href="http://www.anything.co.uk/Rags.htm">MM58</a></li>
<li>Bedd y morwr <a href="http://www.anything.co.uk/A_Welsh_medley.htm">MM321</a></li>
<li>Ashokan farewell<a href="http://www.xx.co.uk/mm/M0245_Ashokan_farewell.htm">MM245</a></li>
</ul>

the sed

Code:

<input sed 's@\(<li>\)\(.*\+\)<a href="http.*/\(.*\+\).htm.*@\1<a href="Tunes/M\3.pdf">\2</a></li>@'

output

Code:

<ul>
<li><a href="Tunes/MContra_reels.pdf">Beans </a></li>
<li><a href="Tunes/MFrench_Canadian_reels.pdf">Beaulieu </a></li>
<li><a href="Tunes/MRags.pdf">Beaumont rag </a></li>
<li><a href="Tunes/MA_Welsh_medley.pdf">Bedd y morwr </a></li>
<li><a href="Tunes/MM0245_Ashokan_farewell.pdf">Ashokan farewell</a></li>
</ul>

Firerat · 10-22-2019, 12:22 PM

there was mention of names being duplicate with variations ( the unique number )

so I pre-empted

Code:

sed 's@\(<li>\)\(.\+\)<a href="http.\+/[[:alpha:]]\+\([0-9]\+\)\(.\+\).htm.\+\([[:alpha:]]\{2\}\)\([0-9]\+\).\+@<li><a href="Tunes/\5\3\4.pdf">\5\3 \2</a></li>@'

this assumes MM is always two Alpha characters

Code:

<ul>
<li>Beans <a href="http://www.anything.co.uk/Contra_reels.htm">MM339</a></li>
<li>Beaulieu <a href="http://www.anything.co.uk/French_Canadian_reels.htm">MM223</a></li>
<li>Beaumont rag <a href="http://www.anything.co.uk/Rags.htm">MM58</a></li>
<li>Bedd y morwr <a href="http://www.anything.co.uk/A_Welsh_medley.htm">MM321</a></li>
<li><a href="Tunes/MM0245_Ashokan_farewell.pdf">MM0245 Ashokan farewell</a></li>
</ul>

notice that the early sample data is no longer touched

essentially the same

Code:

sed 's@\(<li>\)\(.\+\)\(<a href="\)http.\+/[[:alpha:]]\+\([0-9]\+\)\(.\+\).htm.\+\([[:alpha:]]\{2\}\)\([0-9]\+\)\(.\+\)@\1\3Tunes/\6\7\5.pdf">\6\4 \2\8@'

notice that \1 \2 \3 are the "chunks" wrapped in \(\)

Edit3
and this one is more like the original

Code:

<input sed 's@\(<li>\)\(.\+\)\(<a href="\)http.\+/[[:alpha:]]\+\([0-9]\+\)\(.\+\).htm\(.\+\)\([[:alpha:]]\{2\}\)\([0-9]\+\)\(.\+\)@\1\2\3Tunes/\7\4\5.pdf\6\7\8\9@'

Code:

...
<li>Ashokan farewell<a href="Tunes/MM0245_Ashokan_farewell.pdf">MM245</a></li>
...

a 'simpler' ( shorter ) version

Code:

sed 's@http.\+/[[:alpha:]]\+\([0-9]\+\)\(.\+\).htm\(.\+\)\([[:alpha:]]\{2\}\)\([0-9]\+\)@Tunes/\4\1\2.pdf\3\4\5@'

Firerat · 10-22-2019, 03:54 PM

my escapes are a very bad habit

Code:

sed -E 's@(<li>)(.+)(<a href=")http.+/[[:alpha:]]+([0-9]+)(.+).htm">([[:alpha:]]{2})([0-9]+)(.+)@\1\3Tunes/\6\7\5.pdf">\6\4 \2\8@'

Code:

sed -E 's@http.+/[[:alpha:]]+([0-9]+)(.+).htm(">)([[:alpha:]]{2})([0-9]+)@Tunes/\4\1\2.pdf\3\4\5@'

technically the MM bit should be ([[:upper:]]{2})
if it is not always two, then
(">)([[:upper:]]+)([0-9]+)

so that is UpperCase letter 1 or more times and digit 1 or more times

you may have noticed I corrected another bad habit, the use of *
that is the previous match zero or more times, and a lot of the time is used incorrectly

I don't use sed much these days, and I have still not got rid of the bad habits I picked up following early examples

? is the previous 0 or 1 times

it does seem very confusing, but once you get your head around it it does make perfect sense,

Firerat · 10-22-2019, 05:22 PM

and this fits the "original" original

Code:

curl secretsiteplaceholder/mm/sheets.htm | \
  sed -E '/sheet_list|^\//s@/mm/([[:upper:]]+[0-9]+.+).htm@Tunes/\1.pdf@ \
  > sheetmusic.htm

that is much cleaner

curl is probably not installed by default
you don't *need* it, just run the sed on local copy

edit: if the local copy has http link in it

Code:

sed -E '/sheet_list|^\//s@http.+/mm/([[:upper:]]+[0-9]+.+).htm@Tunes/\1.pdf

real sample

Code:

<p class="sheet_list_title"><a name="MM9001"><a href="/mm/MM09001_pipes_from_the_world_of_bash.htm">MM9001 <span class="sheet_title">pipes from the world of bash</span></a></a></p>
<p class="sheet_list_tunes">The Chelsea Flower Show
/ <span class="disabled">Choo Choo</span>
/ Acid Burns <a href="/mm/MM09002_wooden_shoe_antlerpipes.htm">MM9002</a></p>

toothwright · 10-23-2019, 06:45 PM

@Firerat
I should like to thank you for the analysis and programming suggestions.
It is very patient of you to try to lead me to a solution and I am trying to implement your technique.
I shall post when I make any progress.