LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Pulling text from a text file and append to CSV (https://www.linuxquestions.org/questions/linux-newbie-8/pulling-text-from-a-text-file-and-append-to-csv-932114/)

fishy 03-01-2012 04:37 AM

Pulling text from a text file and append to CSV
 
Hi All,

I don't post here often, but scripting really isn't my thing but I have had an idea that may well save a lot of time !

We have a lot of staff who complete weekly timesheets on paper, and they are all returned here to HQ to process. One of the workers came up with a great idea - send in your timesheet details via SMS on a daily basis. We had a quick chat about formats and I have enough information to get started and prove the concept.

I have installed and configured SMS Server Tools3 on my SIM enabled netbook and have tested it's ability to receive SMS messages in a specific directory.

What I would like to do is create a script that will check the specified directory regularly, and from each file it finds, extract the information into a CSV file. The incoming format is like this:

Code:

From: 447123456789
Sent: 12-03-01 09:07:55
Received: 12-03-01 09:08:01
Subject: GSM1
Modem: GSM1
Report: no
Alphabet: ISO
Length: 25

1130:010312:SW:0930:1345

The last line is the bit that creates the timesheet, the colon being the separator in this example, I can use anything. The format in this example is:

Code:

EMPLOYEE_NO:DATE:CLIENT:START_TIME:ENDTIME
I would want the From, Sent, Received fields extracted as well.

So could someone point in the right direction with the tools I would need for a shell script? Ideally this would also create new CSV's every monday for the next seven days.

I am familiar with the linux shell, just not had to do any scripting in the past!

Any advice greatly appreciated!

Dark_Helmet 03-01-2012 05:55 AM

Let me first say this: what you describe is probably best suited for a scripting language like python or perl for readability. I may come back and edit this post or add a response with a python script.

That said, I took this as an opportunity to teach myself a little more about using sed. The command is long, but I managed to have sed do everything you asked (I think) when it comes to saving and formatting the data.

Here's the command:
Code:

sed -srn '/^[0-9]+/ { s@:@,@g ; H } ; /^From:/ { s@From: (.*)$@\1,@ ; H } ; /^Sent:/ { s@Sent: (.*)$@\1,@ ; H } ; /^Received:/ { s@Received: (.*)$@\1,@ ; H } ; $ { g ; s@\n@@g ; p }' filename.txt
Given the sample data you provided (and saved in a file named testdata.txt:
Code:

user@localhost$ sed -srn '/^[0-9]+/ { s@:@,@g ; H } ; /^From:/ { s@From: (.*)$@\1,@ ; H } ; /^Sent:/ { s@Sent: (.*)$@\1,@ ; H } ; /^Received:/ { s@Received: (.*)$@\1,@ ; H } ; $ { g ; s@\n@@g ; p }' testdata.txt
447123456789,12-03-01 09:07:55,12-03-01 09:08:01,1130,010312,SW,0930,1345

So, the first field is the "From," second field is the "Sent," third field is "Received," and the remaining fields match one-to-one the data in the last line with the colons replaced by commas.

If you do not want the colons replaced by commas, delete the part of the command in blue.

The idea is, you run the command on each file and redirect the output (with >>) to some central CSV file. For instance (I made some sample copies of your data file, modified them a bit, and left them all with a ".txt" extension):
Code:

user@localhost$ sed -srn '/^[0-9]+/ { s@:@,@g ; H } ; /^From:/ { s@From: (.*)$@\1,@ ; H } ; /^Sent:/ { s@Sent: (.*)$@\1,@ ; H } ; /^Received:/ { s@Received: (.*)$@\1,@ ; H } ; $ { g ; s@\n@@g ; p }' *.txt
447123456789,12-03-01 09:07:55,12-03-01 09:08:01,1130,010312,SW,0930,1345
447123456789,01-12-11 09:07:55,01-12-11 09:09:22,1130,010312,SW,0930,1345
447987654321,12-03-01 09:07:55,12-03-01 09:08:01,1130,010312,NE,1212,1345

You'd probably want to alias that command (if you choose to use it), because typing that sucker every time would be... error prone.

Like I said, I might come along and post a python script a little later. Right now: sleep.

fishy 03-01-2012 06:36 AM

Wow!

Thank you so much - that will definitely get me going :-)

Have a great sleep!

catkin 03-01-2012 06:56 AM

Here's a pure bash solution for the parsing and output
Code:

#!/bin/bash

while read -r line
do
    case $line in
        'From:'* )
            from=${line#From: }
            ;; 
        'Sent:'* )
            sent=${line#Sent: }
            ;; 
        'Received:'* )
            received=${line#Received: }
            ;; 
        'Subject:'* | 'Modem:'* | 'Report:'* | 'Alphabet:'* | 'Length:'* | '' )
            ;; 
        * )
            array=( ${line//:/ } )
            employee_no=${array[0]}
            date=${array[1]}
            client=${array[2]}
            start_time=${array[3]}
            endtime=${array[4]}
    esac
done < input.txt

echo "'$from','$sent','$received','$employee_no','$date','$client','$start_time','$endtime'"


Dark_Helmet 03-01-2012 02:07 PM

Here's a modified form of the command... Still does the same thing. It just does it in fewer characters :)

Code:

sed -srn '/^[0-9]+/ { s@:@,@g ; H } ; /^(From|Sent|Received):/ { s@[^:]*:[[:space:]]*(.*)$@\1,@ ; H } ; $ { g ; s@\n@@g ; p }' *.txt
This one might be slightly easier to understand in that, it's clear from the "(From|Sent|Received)" used that sed is concerned with those lines :)

Just as an FYI, the above command triggers on four kinds of lines in the file. Lines that begin with:
1. From:
2. Sent:
3. Received:
4. <one or more digits>

The From, Sent and Received lines can have any number or type of space after the colon. Anything after the space(s) will be printed at the end. The line that starts with one or more digits will be printed regardless.

For the python script... it's more or less similar to catkin's shell script. This python script would be invoked as:
Code:

/path/to/script.py /path/to/sms_files/*.txt
The script:
Code:

#!/usr/bin/python

import fileinput
import re

lastFilename = ""
outputLine = ""

for inputLine in fileinput.input():
    if( fileinput.filename() != lastFilename ):
        lastFilename = fileinput.filename()
        if( outputLine != "" ):
            print( outputLine[1:]  )
            outputLine = ""
    if( inputLine[:5] == "Sent:" or inputLine[:5] == "From:" ):
        outputLine += "," + inputLine[6:].rstrip()
    elif( inputLine[:9] == "Received:" ):
        outputLine += "," + inputLine[10:].rstrip()
    else:
        reMatches = re.match( r'[0-9]+', inputLine )
        if( reMatches != None ):
            outputLine += "," + ','.join( inputLine.split( ':' ) ).rstrip()

print( outputLine[1:] )

As for the rest of your questions...

Quote:

What I would like to do is create a script that will check the specified directory regularly, and from each file it finds, extract the information into a CSV file.
"check ... regularly" = cron job
You just need to determine how often. Cron can schedule a task once per minute all the way up to once per year.

"extract the information to a CSV" = use the command (or one of the scripts provided) and use redirection (e.g. "sed -srn <pattern> /path/to/specific/dir/*.txt >> /path/to/dir/with/file.csv")

Quote:

Ideally this would also create new CSV's every monday for the next seven days
Cron can also run a job on a specific day of the week (e.g. "Monday," Tuesday," etc.). All you would need to do for this task would be to mv the csv file to another name and then touch a new csv file with the same name as previously (so the commands/scripts can refer to the same generic name). For example:
Code:

mv /path/to/dir/with/file.csv /path/to/archive/dir/file_$(date +%Y%m%d).csv ; touch /path/to/dir/with/file.csv

fishy 03-02-2012 07:32 AM

Hats off to you chaps! I had already sorted uploading the csv to Google apps spreadsheets and manipulating in the middle, all the info here will now fill the gaps for me - and more to the point will get me started in scripting which is what I really want to do to develop other items here.

Really, really appreciate your contributions - extra kudos to you both for donating some of your time to help a charity! :-)

I'll post back here what I end up with!

Regards

Ifan


All times are GMT -5. The time now is 07:57 PM.