LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 01-15-2008, 11:03 AM   #1
stevemcb
LQ Newbie
 
Registered: Apr 2006
Distribution: Suse 10.0
Posts: 14

Rep: Reputation: 0
Need script to clean up file


I was sent a text file that is an extract from an old database. The problem is that when the file was created, it had (apparently) arbitrary carriage returns (CR's) in some of the lines. There are about 100k records in the file, so it's just too big to edit by hand.

For the most part, the CR is immediately before a pipe (|). What I need is a script to identify lines that start with a pipe (but there are pipes to segregate each data field in each record, so it can only be a pipe in the 1st position of the line), and then append the line that begins with the pipe to the end of the previous line.

I can't reasonably post an example of the data because the records are ridiculously long.

Anybody have any idea how I can tackle this? I have very little programming experience, but I do have several languages loaded on my machine.

Any help would be very much appreciated!
Stevemcb
 
Old 01-15-2008, 11:11 AM   #2
h/w
Senior Member
 
Registered: Mar 2003
Location: New York, NY
Distribution: Debian Testing
Posts: 1,286

Rep: Reputation: 45
You could use:
Code:
tr -d '\r' < your_file
to get rid of dos-style CR's.
 
Old 01-15-2008, 12:13 PM   #3
stevemcb
LQ Newbie
 
Registered: Apr 2006
Distribution: Suse 10.0
Posts: 14

Original Poster
Rep: Reputation: 0
Need script to edit text file

That was AMAZING!~

Scared the hell out of me when the file started scrolling across the screen, but the result was awesome!

Is it possible to do the same thing if the 1st character is an alpha character?
 
Old 01-15-2008, 12:25 PM   #4
h/w
Senior Member
 
Registered: Mar 2003
Location: New York, NY
Distribution: Debian Testing
Posts: 1,286

Rep: Reputation: 45
Quote:
Originally Posted by stevemcb View Post
Scared the hell out of me when the file started scrolling across the screen, but the result was awesome!
I suppose you should have just written the output straight to file. I hope you redirected it. (tr -d '\r' < fin > fout)
Quote:
Originally Posted by stevemcb View Post
Is it possible to do the same thing if the 1st character is an alpha character?
Not quite sure what you mean ...
 
Old 01-15-2008, 12:33 PM   #5
stevemcb
LQ Newbie
 
Registered: Apr 2006
Distribution: Suse 10.0
Posts: 14

Original Poster
Rep: Reputation: 0
Need script to clean up file

First, off, I ran the tr -d '\r' on a copy of the original file (I'm not totally crazy).

In my original post, I needed to identify lines that started with "|" and append that line to the previous line.

In reading tr --help, it looks like the code you gave me just deleted the 'returns' in the document - is that correct?

On a separate train of thought, I also noticed in the help that there is an "[:alpha;]" in tr, and I'm just wondering if that would go through the file and delete, say, "A" if it was the first character in the line?
 
Old 01-15-2008, 12:43 PM   #6
h/w
Senior Member
 
Registered: Mar 2003
Location: New York, NY
Distribution: Debian Testing
Posts: 1,286

Rep: Reputation: 45
Quote:
Originally Posted by stevemcb View Post
In my original post, I needed to identify lines that started with "|" and append that line to the previous line.
I saw the original statement more as a problem of removing carriage returns from the file, rather than a pattern-matched removal.
Quote:
Originally Posted by stevemcb View Post
In reading tr --help, it looks like the code you gave me just deleted the 'returns' in the document - is that correct?
That is correct. Removes carriage returns of type \r (MACs do a \r\n, for eg.)
Quote:
Originally Posted by stevemcb View Post
On a separate train of thought, I also noticed in the help that there is an "[:alpha;]" in tr, and I'm just wondering if that would go through the file and delete, say, "A" if it was the first character in the line?
No, it would remove all letters from the file. `tr` works on individual characters in the file.
If removing the carriage returns is not enough, and you need to do a pattern-matched removal, I'd look at sed/awk/...
 
Old 01-15-2008, 12:50 PM   #7
h/w
Senior Member
 
Registered: Mar 2003
Location: New York, NY
Distribution: Debian Testing
Posts: 1,286

Rep: Reputation: 45
Quote:
Originally Posted by stevemcb View Post
For the most part, the CR is immediately before a pipe (|). What I need is a script to identify lines that start with a pipe (but there are pipes to segregate each data field in each record, so it can only be a pipe in the 1st position of the line), and then append the line that begins with the pipe to the end of the previous line.
Can this be restated as: remove all carriage returns at the end of a line?
 
Old 01-15-2008, 01:10 PM   #8
stevemcb
LQ Newbie
 
Registered: Apr 2006
Distribution: Suse 10.0
Posts: 14

Original Poster
Rep: Reputation: 0
Script file (whatever)

Maybe! Since I'm working with a copy of the file, it doesn't matter if we try - and fail.
 
Old 01-15-2008, 01:25 PM   #9
h/w
Senior Member
 
Registered: Mar 2003
Location: New York, NY
Distribution: Debian Testing
Posts: 1,286

Rep: Reputation: 45
Quote:
Originally Posted by stevemcb View Post
it doesn't matter if we try - and fail.
Haha. But it does waste both our times.

If it indeed is a case of removing the CR's, I don't know why the 'tr' solution isn't good enough. Aren't you trying to remove all CR's? Or just a selective few that are at the end of a line, before a pipe?

If this is a file that was moved from a DOS to *nix machine, even a:
Code:
dos2unix fin fout
would suffice (provided you have the 'dos2unix' utility.)

I should also have asked, what OS was the file on(dos/windows?), and what is it on now (linux?)?

Last edited by h/w; 01-15-2008 at 01:27 PM.
 
Old 01-15-2008, 01:42 PM   #10
stevemcb
LQ Newbie
 
Registered: Apr 2006
Distribution: Suse 10.0
Posts: 14

Original Poster
Rep: Reputation: 0
Need script file

h/w, I hope I'm not wasting your time.

The file is the output of a table that was on a home-grown application (not created or maintained by me). The application was on an Oracle 7.3 database running in Windows.

We're trying to move it to a new SQL database, but the file has date problems so we can't go direct from one database to another without erroring out constantly.

When a co-worker created the file, it appears that the file got carriage returns, line feeds, or something in it (I can't tell what, but I'm not an IT guru (in case you couldn't tell <grin>). Before I can begin to correct the date errors, I have to get past the unfortunate carriage returns or line feeds.

stevemcb
 
Old 01-15-2008, 01:54 PM   #11
h/w
Senior Member
 
Registered: Mar 2003
Location: New York, NY
Distribution: Debian Testing
Posts: 1,286

Rep: Reputation: 45
Quote:
Originally Posted by stevemcb View Post
h/w, I hope I'm not wasting your time.
Of course not.
Quote:
Originally Posted by stevemcb View Post
The file is the output of a table that was on a home-grown application (not created or maintained by me). The application was on an Oracle 7.3 database running in Windows.
So, the file came from a DOS system. I'll assume you're sitting on a *nix box.
If so, the 'tr' command should be what you're looking for then. '\r' is the DOS carriage return, and it'll delete them.
If you want to replace the CR with say, a newline, you could do a:
Code:
tr '\r' '\n' < fin > fout
Hth.

If this still doesn't do it for you, could you post a snippet of the data so we get a better idea of what's needed to be done?

Last edited by h/w; 01-15-2008 at 01:55 PM.
 
Old 01-15-2008, 02:06 PM   #12
stevemcb
LQ Newbie
 
Registered: Apr 2006
Distribution: Suse 10.0
Posts: 14

Original Poster
Rep: Reputation: 0
Sample data

Here's a sample:
2003123|A15690195|3|N|1994-03-15 00:00:00|OPS$LSANDERS|SOUTHERN|COMPANY|64A PERIMETER CENTER EAST||ATLANTA|GA|US|30346|||REPLACEMENT IS DOA
This starts a new line, but shouldn't - this should all be the same record/line|REPLACE DOA REPLACEMENT|1993-12-16 00:00:00|1993-12-21 00:00:00|1993-12-21 00:00:00|1994-01-11 00:00:00|1994-03-15 00:00:00|N||THIS DOES NOT APPEAR TO BE A DUPLICATE OF CLAIM A14659942. PRIOR CLAIM WAS SERVICED BY NW COMPUTER SUPPORT IN WA STATE. SERIAL #'S MUST HAVE BEEN TYPED IN INCORRECTLY.|A|||CDR-74|||||0||||||||||||00000194.0013.0005
This starts a new line where it should2001235|A15078491|3|N|1994-06-28 00:00:00|OPS$LSANDERS|NPPD||PO BOX 499||COLUMBUS|NE|US|68601|||DOA|REPALCED MONITOR|||||1994-03-15 00:00:00|N|Pending more than 60 days with no resolution; claim rejected|YELLOW STICKY ATTACHED TO CLAIM INDICATED THAT MONITOR WAS RETURNED ON MRA NUMBER #41801 ON 02/22/94.....SOP MRA POINTS TO CLAIM NUMBER #A15078478|R|||JC-1532VMA-2|||||||||||||||||00000194.0014.0005

All the lines that terminate where they should end in a format like this:00000194.0014.0005, but the numbers vary greatly and can include alpha characters.

I can't see anythinf that would indicate why the line was terminated. If you need a larger sample, let me know. Also, incase it's helpful, here's the header of the file:
SERVICE_CENTER_CODE|CLAIM_NUMBER|CLAIM_CODE|STOCK_MERCHANDISE|CHANGED_DATE|EMP_CHANGED_BY|FIRST_NAME |LAST_NAME|ADDRESS_LINE1|ADDRESS_LINE2|PLACE_NAME|STATE_CODE|COUNTRY_CODE|POSTAL_CODE|PHONE_NUMBER|E XTENDED_WARRANTY_NUMBER|COMPLAINT|SERVICE_EXPLANATION|PURCHASE_DATE|FAIL_DATE|DATE_REQUESTED|DATE_CO MPLETED|DATE_RECEIVED|REFURBISH_ALLOWED|NEC_COMMENT|NEC_NOTES|CLAIM_STATUS|REORDER_PART|SERVICE_REPA IR_METHOD_CODE|MODEL_NUMBER|PRIOR_SERVICE_DATE|PRIOR_SERVICE_NUMBER|PROOF_OF_PURCHASE|POP_NOTES|SAP_ WARRANTY_NUMBER|DATE_PAID|TECHNICIAN_NAME|SERIAL_NUMBER|MRA_NUMBER|CLAIM_TYPE|COMPANY_NAME|STATUS_RE ASON_CODE|INVOICE_NUMBER|SAP_REASON_CODE|RESOLUTION_CODE|REPAIR_SEVERITY_CODE|M_ROW$$
 
Old 01-15-2008, 03:29 PM   #13
h/w
Senior Member
 
Registered: Mar 2003
Location: New York, NY
Distribution: Debian Testing
Posts: 1,286

Rep: Reputation: 45
Will this do:
Code:
 awk 'BEGIN{nxt="";}{curr=$0;getline nxt;if(index(nxt, "|")== 1){print curr nxt;}else{print $0;}}' < inputfile > outputfile
It'll be nice to see some other ways around this as well, from the other members ...

Last edited by h/w; 01-15-2008 at 03:34 PM.
 
Old 01-15-2008, 03:40 PM   #14
stevemcb
LQ Newbie
 
Registered: Apr 2006
Distribution: Suse 10.0
Posts: 14

Original Poster
Rep: Reputation: 0
Need script

Thanks, h/w, I will give this a try at my first opportunity. I just got on a 2 hour conference call.
 
Old 01-16-2008, 11:13 AM   #15
stevemcb
LQ Newbie
 
Registered: Apr 2006
Distribution: Suse 10.0
Posts: 14

Original Poster
Rep: Reputation: 0
Script for text file

I didn't get a chance to try it last night, but when I ran it this morning, there are still issues (lines that start with a pipe).

I took an extract of the header line, and 64 lines from the original file - some good records and some that continue to exhibit the problem (plus at least one where the line starts with an alpha character, which is the issue I brought up later in the thread).

I can put the file on a server and point you at it if you think it would help.

Thanks again for your help so far.
Stevemcb
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Clean log bash script? QuarQuar Linux - General 4 10-27-2007 11:46 PM
NO C compiler after clean install + how add a script to startup mariogarcia Debian 4 06-10-2006 05:03 PM
Help with clean-up script fiservguy Programming 5 01-27-2005 12:59 AM
Simple script to clean up old file rbeckett Red Hat 2 09-09-2004 02:38 PM
How i can Clean up the log file of proxy? AZIMBD03 Red Hat 4 10-10-2003 08:27 AM


All times are GMT -5. The time now is 01:55 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration