Advice on how to structure a cut command when dealing with very old files?

Mark_S · 09-16-2015, 04:16 PM

It took me a while to get a linux system up and running, very busy, but one of the things I want to get the hang of is the file processing. To practice I got a bunch of old floppy disks with message threads from my old Compuserve days (and this took a while too), they are flat files and I can easily read them using cat or more. My goal was to go through the file and generate a list of users, subjects, times and such. I thought this would be a good way to practice.

So far I've run into a problem with the cut command in that it can't seem to handle the variable lengths and more importantly delimiters. Or I don't know how to structure it so that it will. My main hang up is that I can't define more than a one character delimiter. The file reads like this

#: 0 S0/Forum Announcement
09-Aug-94 09:35:29
Sb: Announcement
Fm: System
To:

Comics/Animation Forum, V. 3B(73)

Hello, Mark S. Ogilvie
Last visit: 08-Aug-94 10:57:23

Forum messages: 592768 to 656114
Last message you've read: 592768

Section(s) Selected: All

Number of Members in Conference: None

Forum !

ˇ

#: 0 S0/Forum Announcement
14-Aug-94 08:03:03
Sb: Announcement
Fm: System
To:

ˇ

#: 0 S0/Forum Announcement
14-Aug-94 08:44:10
Sb: Announcement
Fm: System
To:

Comics/Animation Forum+, V. 3B(73)

Hello, Mark S. Ogilvie
Last visit: 14-Aug-94 08:03:11

Forum messages: 638497 to 661075
Last message you've read: 638497

Section(s) Selected: All

Number of Members in Conference: None

News Flash:

Updated August 12.

We are soon going to be getting new Forum software that will allow us to open
more sections. Users of certain communications software need to make sure they
have a version that will handle this.

All versions of WinCIM, MacCIM, NavCIS and ASCII programs (ProComm, OzCIS,
etc.) will access the new areas, automatically. Programs which need updating
to access the new areas are:
DOSCIM - You need version 2.2, or later (GO CIMSOFT). If you don't wish to
upgrade, you must enter the forum in Terminal Emulation mode (GO ASCII) to see
the sections above 17.
Mac Navigator - you need version 3.2.1, or later (GO NAVIGATOR).
TAPCIS - you need version 5.42, or later (GO TAPCIS).
AutoSIG - you need version 7, or later (GO IBMCOM).

Post a message to SYSOP if you need help.

--------------------------------------------------------------------------
Japanimation CONference every Sunday at 9 pm Eastern time.
General CONference every Wednesday at 9 pm EST.
BREAKING IN CONference with Rob Davis, second Thursday of each month.
WITSIG party every Saturday at 10pm Eastern in CON room 17, open to all.
--------------------------------------------------------------------------
For biographies of most of the industry professionals that hang out here, read
the files PROBIO.TXT (detailed) or PROSYS.TXT (brief) in LIB 1.

We love to get graphics files, but PLEASE remember that you must have the
right to upload the graphic! Pictures scanned from books, videos or magazines
can NOT be uploaded; that's a violation of copyright.

Please do not repeatedly attempt to page or chat with members that you see in
the Forum. Many of them use auto-navigators or are unable to respond to
real-time chat. Attend our weekly informal conference on Wednesday, or post a
message - it is much easier, and gets you a better reply.

We also ask our members to use their real names, first and last.

Forum !

ˇ

#: 0 S0/Forum Announcement
14-Aug-94 08:52:36
Sb: Announcement
Fm: System
To:

Comics/Animation Forum+, V. 3B(73)

Hello, Mark S. Ogilvie
Last visit: 14-Aug-94 08:46:20

Forum messages: 638497 to 661075
Last message you've read: 661075

Section(s) Selected: All

Number of Members in Conference: None

Forum !

#: 658561 S1/General
11-Aug-94 21:25:50
Sb: #Lois n Clark show's dumb
Fm: Phil Adams 72470,1156
To: David Munier 73160,1670 (X)

Actually, I think realistic dialogue is a helluvalot more entertaining
than most of what passes for dialogue in entertainment today.

Phil Adams
Promethean Studios

There is 1 Reply.

#: 658836 S1/General
12-Aug-94 00:51:20
Sb: #658561-#Lois n Clark show's dumb
Fm: David Munier 73160,1670
To: Phil Adams 72470,1156 (X)

True. I was actually thinking of some real dialogue that doesn't go much
beyond:

"Hi"
"Hey"
<Grunt>

I was a bit tired when I wrote that remark.

Well-written dialogue is always entertaining. But that seems to be a
redundant statement.

-David Munier

There is 1 Reply.

Am I expecting too much from the cut command? If I could define the delimiter I could separate out lines like Sb: Fm: and such, but I can't figure out a way to do that. Am I using the wrong command?

danielbmartin · 09-16-2015, 07:18 PM

Quote:

Originally Posted by Mark_S

Am I expecting too much from the cut command?

awk is a better choice.

You provided a sample of the input file. It would be helpful if you also provided a corresponding output file. That would help the readers to better understand the problem, and also to test any code we might write.

Daniel B. Martin

Mark_S · 09-16-2015, 07:30 PM

Quote:

Originally Posted by danielbmartin

awk is a better choice.

You provided a sample of the input file. It would be helpful if you also provided a corresponding output file. That would help the readers to better understand the problem, and also to test any code we might write.

Daniel B. Martin

I'm a little embarrassed to say that I didn't think of putting up the output file. I'll put it out tomorrow after work.

HMW · 09-17-2015, 03:53 AM

I'm sure the awk ninjas will come along shortly and do their thing. But you could also use a loop to read the file.
Something like this:

Code:

#!/bin/bash

while read line; do
    if [[ $(echo $line | grep '^Fm') ]]; then
        echo $line | awk '{ print $2 " " $3 }'
    fi  
done < Mark_S.txt

exit 0

From your infile (here known as 'Mark_S.txt'), I get this result (extracting only the lines beginning with 'Fm' and then printing first and, if there is one, last names):

Code:

$ ./read_Mark_S.sh 
System 
System 
System 
System 
Phil Adams
David Munier

You can of course expand this in any number of ways.

Best regards,
HMW

danielbmartin · 09-17-2015, 07:22 AM

With the InFile as given in Post #1, this code ...

Code:

grep "^Sb\|^Fm" $InFile      \
|tr -cd '\11\12\15\40-\176'  \
|paste -sd" \n"              \
>$OutFile

... produced this OutFile ...

Code:

Sb: Announcement Fm: System
Sb: Announcement Fm: System
Sb: Announcement Fm: System
Sb: Announcement Fm: System
Sb: #Lois n Clark show's dumb Fm: Phil Adams 72470,1156
Sb: #658561-#Lois n Clark show's dumb Fm: David Munier 73160,1670

Explanation:
grep "^Sb\|^Fm" $InFile reads InFile, keeps lines starting with Sb or Fm.
tr -cd '\11\12\15\40-\176' gets rid of "garbage" characters.
paste -sd" \n" combines matching Sb and Fm lines.
>$OutFile writes OutFile.

Daniel B. Martin

grail · 09-17-2015, 07:32 AM

Please place code or data in [code][/code] tags to maintain formatting.

As you didn't really explain how you wanted to use cut, my initial feedback would be to simply use grep based on your last input:

Code:

grep -E '^\s*(Fm|Sb):' file

This will return the required lines, but not necessarily the data you wanted specifically.

NevemTeve · 09-17-2015, 09:25 AM

(If your files are older than five years, you can use -d option of cut(1))

Mark_S · 09-17-2015, 10:51 AM

This explains a lot, I was trying to do this with only one line.
cut -f1-4 -d:COMICS1.MSG > comic_test1

I'll try your suggestions tonight and let you know how it comes out. Thanks all.