LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 07-30-2003, 04:56 PM   #1
slakmagik
Senior Member
 
Registered: Feb 2003
Distribution: Slackware
Posts: 4,113

Rep: Reputation: Disabled
Catch-22 with 'uniq' - sed, awk, another way out?


Maybe this should go in Programming but I'll try it here. I've got a directory of html files. I'm able to

grep 00:00:00 * > output1.htm

to produce a file containing only the lines with 00:00:00 in them.

Then I can

sed -e 's/0.*A>//g' output1.htm > output2.txt

to remove everything up to a name because all files are named 0blabla and before every name is the closing tag of a hyperlink - the A> is enough to take care of everything up to and including the </A>.

Then I can simply

sort output2.txt -o output3.txt

and get the names sorted in order.

Problem is, some of the lines overlap - they appear at the bottom of one html file and the top of the next. So I thought of doing 'uniq' but it appears from the man page to only work on *successive* lines. And the only way I can make them successive is to *remove* what makes them different and sort them. But then virtually all the lines would removed at that point. I need to do 'uniq' before 'sed' and 'sort'. But then they're not successive.

I tried 'sort -k4 output3.txt -o output4.txt' but that leaves '-k1' unsorted so they are still not successive. And, again, if I sort again by '-k1' that sorts them by date but no longer by name so they're *still* not successive.

Code:
0707.html:07/07/03  00:00:00 <A HREF="thread.cgi?63,0,800">Post Title</A> - digiot [ 6 <EM></EM> ]<BR>
I want to be left with a

- digiot [ 6 <EM></EM> ]<BR>

that I can sort but with redundantly dated lines removed. (Obviously I'm not too picky about making it elegant.) I can actually do a 'sed -e 's/\[ .*br>//g' outputN.txt > outputN.txt' to clean that up if I want.

But I just know I'm going about this wrong. It can't require this many steps. I read some sed threads and came across a reference to awk but didn't quite understand it. Do I need sed or awk or uniq or something else? (I barely even understand the sed line I use - I just came across it in a tutorial and saw that it was what I wanted... in part.) I do want to learn all this because... well, I love it. It's what it's all about. This stuff is *sooo* much cooler than compiling damn drivers. But this is kind of a project I need to do in a hurry, so I'm trying to fake it rather than really comprehend it all today. *g*

Any help would be muchly appreciated and will probably help even more when I do buckle down to grasp all this for real. Thanks.

Oh, and do all these tools have case-insensitivty switches and, if so, what are they? I just ran stuff through twice because some are </A> and some are </a>.

Ugh. Easier to do eyeballing it but I figure in the long run, automation would win. Besides, it's a learning experience!
 
Old 07-30-2003, 05:17 PM   #2
Blinker_Fluid
Member
 
Registered: Jul 2003
Location: Clinging to my guns and religion.
Posts: 683

Rep: Reputation: 63
I made 2 files to test this on... ran this command:
grep 00:00:00 * | sed -e 's/0.*A>//g' | sort
which gives me lines that look like this:
junk2: - digiot [ 6 <EM></EM> ]<BR>
junk2: - digiot [ 6 <EM></EM> ]<BR>
junk2: - digiot [ 6 <EM></EM> ]<BR>
junk3: - digiot [ 6 <EM></EM> ]<BR>
junk3: - digiot [ 6 <EM></EM> ]<BR>

now you want to remove the junk3: part?
sed 's/^..*- digiot/- digiot/g

so complete command would be:
grep 00:00:00 * | sed -e 's/0.*A>//g' | sort | sed 's/^..*- digiot/- digiot/g'

which would give you output that looks like:
- digiot [ 6 <EM></EM> ]<BR>
- digiot [ 6 <EM></EM> ]<BR>
- digiot [ 6 <EM></EM> ]<BR>
- digiot [ 6 <EM></EM> ]<BR>
- digiot [ 6 <EM></EM> ]<BR>
- digiot [ 6 <EM></EM> ]<BR>
- digiot [ 6 <EM></EM> ]<BR>
 
Old 07-30-2003, 05:25 PM   #3
Blinker_Fluid
Member
 
Registered: Jul 2003
Location: Clinging to my guns and religion.
Posts: 683

Rep: Reputation: 63
Just a little help with sed
basic syntax:
sed 's/string_to_search_on/string_to_replace_it_with/g' filename
sometimes the -e is needed one of those I'm not sure what it does things...
so your script
sed -e 's/0.*A>//g' output1.htm
looks for 0 then everything to A> and replaces it with nothing. One problem you may have is when October comes around your file is going to start with 10/01/03 and then your line outputed will be:
1 - digiot [ 6 <EM></EM> ]<BR>

If you want everything from the front of the line use the ^ character so like this:
sed -e 's/^*A>//g' filename(s)

one last thing... if you are just looking to remove and sort duplicate lines a sort -u might be helpful...

If you need some more help maybe posting a few lines (change whatever you don't feel comfortable posting... ie IPs, usernames, etc)

Last edited by Blinker_Fluid; 07-30-2003 at 05:44 PM.
 
Old 07-31-2003, 05:56 AM   #4
slakmagik
Senior Member
 
Registered: Feb 2003
Distribution: Slackware
Posts: 4,113

Original Poster
Rep: Reputation: Disabled
[I got hit by a couple of thoughts while writing this and tried them after, but nothing worked. I think I'm giving up on this and trying a different approach entirely, but I'm just going to post this anyway to let you know what was going on. But maybe none of this will be necessary if the other angle works.]


Hey, thanks so much for taking the time. Sorry - I wondered if I was making any sense and did a terrible job with that. I've got several files, two of which are something like this:

0706.html
Code:
06/07/03  00:01:01 <A HREF="thread.cgi?38,0,820">Whoops</A> - Joe Public [ 5 <EM></EM> ]<BR>
06/07/03  00:00:00 <A HREF="thread.cgi?37,0,820">Huh</A> - Tom [ 5 <EM></EM> ]<BR>
06/07/03  00:00:00 <A HREF="thread.cgi?36,0,820">Wow</A> - Dick [ 6 <EM></EM> ]<BR>
06/07/03  00:00:00 <A HREF="thread.cgi?35,0,820">Whee</A> - Harry [ 6 <EM></EM> ]<BR>
0707.html
Code:
07/07/03  00:00:00 <A HREF="thread.cgi?41,0,800">Whoah</A> - Tom [ 5 <EM></EM> ]<BR>
07/07/03  00:00:00 <A HREF="thread.cgi?40,0,800">Shebang!!</A> - Dick [ 8 <EM></EM> ]<BR>
07/07/03  00:00:00 <A HREF="thread.cgi?39,0,800">Bop - bop</A> - Harry [ 7 <EM></EM> ]<BR>
06/07/03  00:01:01 <A HREF="thread.cgi?38,0,800">Whoops</A> - Joe Public [ 5 <EM></EM> ]<BR>
06/07/03  00:00:00 <A HREF="thread.cgi?37,0,800">Huh</A> - Tom [ 5 <EM></EM> ]<BR>
and grep produces this:

Code:
0706.html:06/07/03  00:00:00 <A HREF="thread.cgi?37,0,820">Huh</A> - Tom [ 5 <EM></EM> ]<BR>
0706.html:06/07/03  00:00:00 <A HREF="thread.cgi?36,0,820">Wow</A> - Dick [ 6 <EM></EM> ]<BR>
0706.html:06/07/03  00:00:00 <A HREF="thread.cgi?35,0,820">Whee</A> - Harry [ 6 <EM></EM> ]<BR>
0707.html:07/07/03  00:00:00 <A HREF="thread.cgi?41,0,800">Whoah</A> - Tom [ 5 <EM></EM> ]<BR>
0707.html:07/07/03  00:00:00 <A HREF="thread.cgi?40,0,800">Shebang!!</A> - Dick [ 8 <EM></EM> ]<BR>
0707.html:07/07/03  00:00:00 <A HREF="thread.cgi?39,0,800">Bop - bop</A> - Harry [ 7 <EM></EM> ]<BR>
0707.html:06/07/03  00:00:00 <A HREF="thread.cgi?37,0,800">Huh</A> - Tom [ 5 <EM></EM> ]<BR>
where I don't want Joe Public's posts included (which works) but I only want one of Tom's "Huh" threads (which doesn't). I took the first few steps I described and got exactly the output I wanted except that I then realized the redundant lines issue. I want output like this next, except that anything but the names is optional - the names are essential and the rest can be sed'ed out or not.

Code:
0706.html:06/07/03  00:00:00 <A HREF="thread.cgi?37,0,820">Whoah</A> - Tom [ 5 <EM></EM> ]<BR>
0707.html:07/07/03  00:00:00 <A HREF="thread.cgi?34,0,800">Huh</A> - Tom [ 5 <EM></EM> ]<BR>
0706.html:06/07/03  00:00:00 <A HREF="thread.cgi?36,0,820">Wow</A> - Dick [ 6 <EM></EM> ]<BR>
0707.html:07/07/03  00:00:00 <A HREF="thread.cgi?67,0,800">Shebang!!</A> - Dick [ 8 <EM></EM> ]<BR>
0706.html:06/07/03  00:00:00 <A HREF="thread.cgi?35,0,820">Whee</A> - Harry [ 6 <EM></EM> ]<BR>
0707.html:07/07/03  00:00:00 <A HREF="thread.cgi?66,0,800">Bop - bop</A> - Harry [ 7 <EM></EM> ]<BR>
Another way to do it, maybe, would be a 'for...in..do' kind of thing where I process each individual file and remove any line from '0707.html' that doesn't include 07/07 and keeps only those that do and so on for 0708.html with 07/08 and the rest. That would strip the redundancies before I grepped the 00:00:00 lines and those I could simply sort. Only trouble is, I don't know how to do a 'for...in...do' type command.

If I'm understanding your first post right, that depends on all posts being by 'digiot' right? So that's where I didn't explain myself clearly - there's digiot and tom and dick and harry, so I can't end the sed replacement string with any specific name. That's why I was keying on generic html elements. And I don't really care if there's junk in there or not - it was more the task of sorting them by name - since there are a variable number of whitespaces before the name, I just figured I'd strip everything before the name and do a simple sort instead of on a column or field or whatever.

Thanks, too, for the clarification of sed syntax. (And the warning about October - I actually thought of that but the important part of October would be in the filename - the dates in the files wouldn't matter as much - and it's no problem to switch the command to '1.*>' on sed for that. As a matter of fact, I think there's a way to prevent grep from writing in the '0707.html:' part, anyway. But it's exactly that kind of detail that trips me up all the time.) Anyway - I was wondering about that '-e' myself. 'add the script to the commands to be executed' didn't make any sense to me, but didn't seem to hurt anything. And I get what the 's/string to replace/replacement/g' is mostly - the '*' means all characters between '0' and '>'? And the nothing between '//' means to replace it with nothing - in other words, turns string replacement into string deletion? But what's the dot or period '.' before the asterisk for? I would think it was to 'not treat * as literal' but that's not the way it usually works - it's more like '\*' to 'do treat the asterisk as literal'.

Anyway - thanks again. (I appreciate your pipes, too - I would like to compact the command as much as I could, but I was more concerned with checking the results of each command, so was entering them one at a time.
 
Old 07-31-2003, 07:59 AM   #5
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3594Reputation: 3594Reputation: 3594Reputation: 3594Reputation: 3594Reputation: 3594Reputation: 3594Reputation: 3594Reputation: 3594Reputation: 3594Reputation: 3594
The main "problem" is stripping the HTML representation because it unnecessarily clutters up your results and makes it (a little bit) harder to produce results. If you're not able to dump the URI's to text using something like "lynx -nolist -dump <URI> | grep (etc etc)" and work from there, you can use chars, you know are at similar positions in all files, as field delimiters. In all examples the part before the first pipe should be you doing your grep stuff, remove the outer quotes before usage.

If you're able to use one non-standard utility to strip HTML, you can have this result:
Code:
Tom 5
Tom 5
Tom 5
Harry 7
Harry 6
Dick 8
Dick 6
using " | html2text -nobs | tr -s " " | awk '{print $5, $7}' | \sort -r -k5". The "tr" invocation turns multiple whitespaces into one, "awk" just prints the fields we now know are similar, and the backslash+-r for "sort" are only there because of me needing to escape my default sort alias.

If not able to strip HTML, you can still achieve the same result using " | tr -s " " | cut -d ">" -f 3- | cut -d "<" -f 1 | cut -d " " -f 3,5 | \sort -r -k1". Again, the "tr" invocation turns multiple whitespaces into one, "cut" uses the caret as field delimiter and outputs every field starting from the 3rd one, the same but now the other caret and 1st field, same using whitespace and the 3rd and 5th field, and finally a sort.

One other way, using a loop could and w/o any external utilities except "sort" could like this:
" | while read line; do line=( ${line} )
anchorfield0=5; anchorfield1=3
let printfield0=${#line[@]}-$anchorfield0
let printfield1=${#line[@]}-$anchorfield1
echo ${line["$printfield0"]} ${line["$printfield1"]}
done | \sort -r -k1". If you want to understand what's going on here, do a "set -x" before you start grepping and tack on "2>&1|tee output.log", then "less output.log" should get you started.

*You'll notice these are "dumb" solutions, because they just output fields w/o checking if a field has the desired range/value/length.
 
Old 07-31-2003, 09:18 AM   #6
slakmagik
Senior Member
 
Registered: Feb 2003
Distribution: Slackware
Posts: 4,113

Original Poster
Rep: Reputation: Disabled
Whoah. Thanks unSpawn. That'll take awhile to digest. Those are 'dumb' solutions? I'm not sure I want to see the smart ones, then. *g* Nah, I'm just kidding. I see what you're saying about the error-checking. But this is way beyond me right now. I'm just winging it for a specific result and was just coming to post that I'd hit on the same thing - part of the html code was actually useful for differentiating lines but wasn't absolutely necessary and more of a pain than it was worth, so I did process the files as text and got it to work. I also stripped off those end numbers (though I had to strip [n] instead of n) and, for a finishing touch, 'uniq -ci' was cool, as it counted up the results. I was leaving that off because it was optional and I wanted to work on the first problem first. Also, in re-reading the manual - read, re-read, re-re-read - for uniq I realized I could have taken better advantage of the -s switch (skip chars), I think. At certain points I could uniq -s then sed '/sX/X/g' later.

I suspect that's a problem with me though. I feel overwhelmed by the complexity (of even simple stuff) sometimes and so I try to do things a simple step at a time. But then this results in excessive labor and falling into pitfalls I didn't see ahead of time - like the redundant lines I needed to remove. If I could visualize the *entire* problem and figure out the shortest solution to the whole deal, I think I'd be better off. But it's a different way of thinking.

I should post exactly what did I guess, but I doubt anyone else would have the same problem and what I did is too dumb to repeat, though it did (barely) work and is set up in a minimal way for later.

Thanks to you and Blinker_Fluid. I'm going to save and study your posts. If I can get it to a point where I can do this with a nice, concise shell script, I'll be in Linux heaven.

Oh - one last thing. That's one problem with the man pages - they're necessarily kind of abstract and isolated. Reading the tutorial I found (that has at least another part) or having threads like this, where you see how stuff is put together to accomplish a concrete objective... well, it's just a lot more effective for learning, to me. And with the man pages, you have to know which commands you need to begin with, in order to select the man page - though all the 'apropos' and what not can help with that.

Last edited by slakmagik; 07-31-2003 at 09:20 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
SED, AWK or PERL HELP embsupafly Programming 6 08-20-2005 09:07 PM
Sed & Awk hinetvenkat Linux - Software 4 05-30-2005 05:10 AM
awk and sed issues alaios Linux - General 11 03-24-2005 05:33 AM
awk/sed help pantera Programming 1 05-13-2004 11:59 PM
sed/awk problem player_2 Programming 9 08-26-2003 06:09 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 06:46 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration