LinuxQuestions.org - simple bash script editing

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - simple bash script editing (https://www.linuxquestions.org/questions/linux-newbie-8/simple-bash-script-editing-656664/)

simple bash script editing

This is probably a simple thing, but my experience with bash is limited.

I want to automatically edit a bunch of HTML files when they are generated every month and here is the basic criteria that I need to edit on:

1. For every table row <TR> that has the string "Total Files" in it, delete that entire table row.

2. For every string found that has the line "Hostname", rename it to "Connections".

3. For every string that has "Top * of * Total URLs", replace it with "Databases". (The * are automatically generated numbers)

Quote:

Originally Posted by tekmann33 (Post 3219128)

Yes, it is a simple thing. Look at the man page for sed, it should give you what you need.

Personally, I would not try to learn SED from the man page. Go here for an excellent tutorial: http://www.grymoire.com/Unix/Sed.html

The problem you describe is non-trivial in SED, if the <TR> structure spans multiple lines. (SED works one line at a time.)

When you say "every string that......" , it is ambiguous. You have to be able to define where the string starts and stops.

Look also at "AWK". The Grymoire site has a good tutorial on that also. In addition, you may want to look at "Bash Guide for Beginners" and "The Advanced Bash Scripting Guide"....both free at http://tldp.org

Processing HTML with sed is a bit tricky, because you can't really know how it will be formatted. sed is line-oriented. If you can be sure that your HTML will be something like this:

Code:

...

<tr><td>something</td><td>else</td></tr>

<tr><td>Total files</td><td>17</td></tr>

...

...then it's pretty easy - you can just drop the lines which start with <tr> and have "Total files" on them.

However, consider this:

Code:

...

<tr>

  <td>something</td>

  <td>else</td>

</tr>

<tr>

  <td>Total files</td>

  <td>17</td>

</tr>

...

It's the same thing as far as HTML is concerned, but using sed won't cut it, as no single line can be filtered out.

For sure you can write a program with awk or perl which can do it, but it's annoyingly tricky for something so apparently simple. If you can remove the data before it is turned into HTML, it would probably be easier and more robust.

If not, you might want to consider using some HTML parsing library.

Hi.

Here is one way to delete a table row:

Code:

#!/usr/bin/perl



# @(#) p1      Demonstrate delete of line-spanning HTML table row.



use warnings;

use strict;



my ($debug);

$debug = 0;

$debug = 1;



my ($entire) = slurp();



$entire =~ s|<tr>\s*<td>Total files</td>.*?</tr>||ms;

print $entire;



sub slurp {



  # Best practices, p213 for a file.

  my $scalar = do { local $/; <> };

  return $scalar;

}



exit(0);

Driving this with a short shell script:

Code:

#!/bin/bash -



# @(#) s1      Demonstrate match across lines.



echo

echo "(Versions displayed with local utility \"version\")"

version >/dev/null 2>&1 && version =o $(_eat $0 $1) perl tidy

set -o nounset

echo



FILE=${1-data1}



echo " Data file $FILE:"

cat $FILE



echo

echo " Results:"

./p1 $FILE |

tidy -i -q



exit 0

To produce:

Code:

% ./s1



(Versions displayed with local utility "version")

Linux 2.6.11-x1

GNU bash 2.05b.0

perl 5.8.4

HTML Tidy for Linux/x86 released on 1st August 2004



 Data file data1:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">

<html>

<head>

<meta name="generator" content=

"HTML Tidy for Linux/x86 (vers 1st August 2004), see www.w3.org">

<title>Stuff</title>

</head>

<body>

<table summary = "This is what is in this table">

<tr>

<td>something</td>

<td>else</td>

</tr>

<tr>

<td>Total files</td>

<td>17</td>

</tr>

</table>

</body>

</html>



 Results:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">



<html>

<head>

  <meta name="generator" content=

  "HTML Tidy for Linux/x86 (vers 1st August 2004), see www.w3.org">



  <title>Stuff</title>

</head>



<body>

  <table summary="This is what is in this table">

    <tr>

      <td>something</td>



      <td>else</td>

    </tr>

  </table>

</body>

</html>

The entire file is read into a scalar, then the specific row is deleted, and whatever remains is written out.

The HTML was cleaned up on output with tidy.

See appropriate man pages for details ... cheers, makyo