FTP script to manage unprocessed files

suneeladdala · 12-02-2007, 08:02 PM

Hi Friends!

I would be very thankful if you can help me out.
Here is a scenario. I have an ftp script which runs every day and brings sysdate-1 files from the ftp server.we run this script on a daily basis and archive them after the files are processed into the data warehouse for each day.
Now I need a script which has to look at the archive folder and find the latest processed file date ( say it found 11/27/2007 in the archive folder), then it has to pull the files after that date till yesterday from the ftp server.
Here is my requirement..do you guys have any sample script which does this check?
Could you please send me if u have??

Thanks a lot for your time.

Dr_P_Ross · 12-03-2007, 05:35 AM

I would suggest using perl for this, and in particular the nice Date::Calc library. Here is a simple script, not as compact as a typical perl hacker might like but spelled out so that you can see what's going on. The Date_to_Days function in Date::Calc returns the number of days since Jan 1, 1 A.D. and therefore makes the decision about which files to fetch easy. You would need the ncftp package for this, it contains the useful commands ncftpls and ncftpget.

Code:

#!/usr/bin/perl -w
#
# Usage: getRecent username password localDir

use Date::Calc qw( Decode_Date_EU Today This_Year Date_to_Days Decode_Month);

my ( $user, $pass, $localDir, $remoteFTPsite );
my ( $latestLocalDate, $latestDateAsNum, $todayAsNum );
my ( $day, $month, $year, $fileName, $fileDateAsNum, $wantedFiles, $nWanted );
my @field;

# grab some vital parameters from the command line:

$user     = shift || die "must specify user password and local dir\n";
$pass     = shift || die "must specify password and local dir\n";
$localDir = shift || die "must specify local dir\n";

# where the remote stuff can be found:

$remoteFTPsite = "ftp://remote-data-store.com/reports/";

# get the date of the most recent file in $localDir. The date will
# be expressed as DD/MM/YYYY, see 'man ls' and 'man date'. We grab
# the second line (after 'total' stuff) and its sixth field:

$latestLocalDate = `/bin/ls -lt --time-style=+'%d/%m/%Y' $localDir | gawk 'NR==2{print \$6}'`;
chomp $latestLocalDate;
$latestDateAsNum = Date_to_Days(Decode_Date_EU($latestLocalDate));

# get today as a number, and this year since ls info does not include
# the year, we assume (check this for your application):

$todayAsNum = Date_to_Days(Today());
$year       = This_Year();

# fetch a listing of the remote files. Lines usually look like this:
#        -rw-r--r--  1 fred users   9327  Oct 17 14:45 CustomerReport0977.txt
# field:    0        1   2    3      4     5   6   7     8
# Note that's a minus-el not a minus-one in the command below:

open(REMOTE_LS, "/usr/bin/ncftpls -u $user -p $pass -E -l $remoteFTPsite |");
$wantedFiles = "";
$nWanted = 0;
while(<REMOTE_LS>) {
  chomp;             # clean up the line
  @field    = split; # split line into fields
  $month    = Decode_Month($field[5]);
  $day      = $field[6];
  $fileName = $field[8];
  $fileDateAsNum = Date_to_Days($year, $month, $day);
  print "$day/$month/$year $fileName $fileDateAsNum\n";
  if($fileDateAsNum > $latestDateAsNum && $fileDateAsNum < $todayAsNum) {
    # ok, we want this file:
    $wantedFiles = "$wantedFiles $fileName";
    $nWanted++;
    printf("... %02d/%02d/%d %s\n", $day, $month, $year, $fileName);
  }
}
close(REMOTE_LS);

# Once you are happy, change printf to system in the line below:

if($nWanted > 0) {
  printf("ncftpget -u $user -p $pass -E $remoteFTPsite $localDir $wantedFiles");
} else {
  print "no files to fetch\n";
}

cacycleworks · 12-03-2007, 04:49 PM

Also keep in mind that you don't need to stay all perl... I often use bash scripts as wrappers. Let me share some of what I do as an example case:

Code:

14:59 ~/Catalog$ cat catalog.sh
#!/bin/bash
umask 007
#
# catalog.sh -- a script to call legacy catalog compile  scripts.
#
echo "$0: catalog compile and update script"

echo ""
echo "Running compile_catalog.pl to build html files"
./compile_catalog.pl
if [ ! $? ];  then
        echo "catalog compile failed; edit source file or catalog.pl"
        exit $EX_IOERR
fi
echo "   ... done."

echo ""
echo "Running lftp to update website catalog using commands from ftp_script.scp"
lftp -f ftp_script.scp
if [ ! $? ] ; then
        echo "lftp failed"
        exit $EX_IOERR
else
        echo "   ... done."
        echo ""
        echo "website catalog update is complete"
fi

14:59 ~/Catalog$ cat ftp_script.scp
debug 0
set cmd:parallel 20
set dns:cache-enable
set net:connection-limit 20
open <host>
user <username pass>
CD catalog
LCD /home/www/catalog
MPUT *.html
CD some_dir
LCD /home/www/catalog/some_dir
MPUT *.html
CD ../some_dir2
CD some_dir2
LCD /home/www/catalog/some_dir2
MPUT *.html
CLOSE
exit

lftp is standard install part of ubuntu and allows parallel connections, which really speeds bulk uploads.

Additionally, you can combine the find command with the above .sh and .pl scripts to really get it going.

Here's an example of where I use find in an .sh script to delete intermediately compiled html as well as sourcefile backup files. In this example, the backs were the file_name.<numeric_date_code>

Code:

15:05 ~$ cat clear_html_and_backups.sh
#!/bin/sh
echo "deleting html files..."
NUM=`find . -type f -name '*html' -perm 644 | wc -l`
find . -type f -name '*html' -perm 644  -exec rm {} \;
echo "html files deleted: $NUM"
echo ""
echo ""
echo "db_0.txt backup files deleted:"
find www/ -type f -name 'source_file.txt.*' -print  -exec rm {} \;

Obviously, you can test the find commands with cp rather than rm... I always make about 3 tests and a backup before rm! See how I set the NUM variable? That's a real good test. Find has real useful time testing, absolute and relative -- and can be compared to a given file's date. Find allows for some amazing single line entries in crontabs. I did a cron's find once that must have been 200 characters long. :P

BTW, I used a perl script to read source_file.txt to generate HTML pages which the top scripts copy to local "current" folder and then ftp up to site. All of this is currently being replaced with php + mySql.

Anyhow, the point of my post is that I tend to combine methods to form the shortest solution to code. Sometimes a shell command is the best (.sh, ftp, and find) or sometimes perl (to handle more complex text parsing).

Chris

suneeladdala · 12-03-2007, 06:44 PM

Hi Ross,

Thanks for your reply. It really helps a lot.
I'm not good at perl.so I did not understand few lines.
Could you please explain me what these mean?

1. open(REMOTE_LS, "/usr/bin/ncftpls -u $user -p $pass -E -l $remoteFTPsite |");

what are -E and -l after $pass in the above line and what does pipe symbol | before the quotes indicate??

2. @field = split; # split line into fields

what r we telling to perl here??

3.printf("... %02d/%02d/%d %s\n", $day, $month, $year, $fileName);
}

what does ("... %02d/%02d/%d %s\n", mean here??

Also, it looks like u r finding the file date based on the time stamp...right? should not it be from the file name if it has the datein the name it self like abc_09282007.txt ??
Actually I have a solaris OS ..does this work or do you know where I need to modify.

Thanks a lot for ur help.
Really appreciate it.

Dr_P_Ross · 12-04-2007, 11:15 AM

what are -E and -l after $pass in the above line and what does pipe symbol | before the quotes indicate??

The -E tells ncftpls to use an active connection -- see http://www.slacksite.com/other/ftp.html for an explanation. This works better for some data providers, but you may not need it. The -l instructs ncftpls to request a detailed listing, like "ls -l"; without it you just get a list of files but without date information.

@field = split;

Actually, split here is a command; one of the things that people find hard to get used to about perl is that if you don't supply an argument, it uses a default argument, which in this case happens to be the current line. The while command sets the default argument to be the contents of the line. Then split chops it into separate fields, using the default assumption that fields are delimited by white space (any number of spaces and/or tabs). The @field is an array; after the split, the fields on the line can be found in $field[0], $field[1], $field[2] etc. See the comment in the perl script just above that point, which gives an example line and shows how it would be chopped into separate fields.

what does ("... %02d/%02d/%d %s\n", mean here??

This is a formatting specificiation. Each % marks the start of an instruction about how to present the next argument in the sequence of arguments given to printf. The details can be found by looking at one of several man pages; try the command man printf.

If you don't know much about this sort of thing, it's worth trying to learn. Try man perlintro, or do what everyone else does: get a book, preferably choosing one whose style and price you like, start copying the examples and then modifying them. There is nothing like making a lot of mistakes to really help you learn something new, so start trying to enjoy making strange mistakes to see what happens!

perl has been around a long time, but there are enormous numbers of libraries freely available for it, and it is pretty efficient. For example, there are libraries that you can use to generate or read Excel spreadsheets without ever using a M*cr*s*ft product.

Of course, don't learn just perl. It's a good place to start learning some things that you will find reappearing in lots of other computer languages.

suneeladdala · 12-04-2007, 06:00 PM

Hi Ross..
It's pretty clear except a single ques. I guess you forgot to answer.

It looks like u r finding the file date based on the time stamp...right? should not it be from the file name if it has the date in the name it self like abc_09282007.txt ??

does this process also gets the correct files??

Thanks a lot for your suggestions. I will start learning more perl.

Regards

Dr_P_Ross · 12-05-2007, 04:07 AM

It looks like u r finding the file date based on the time stamp...right? should not it be from the file name if it has the date in the name it self like abc_09282007.txt ??

The perl script does use the time stamp rather than the name. Extracting a date from a filename depends, of course, on the particular format of the name. Here is a hint:

Code:

  @parts = split(/([-.])/, "file-23-07-2007.txt");

splits up that name wherever a dash or dot appears and puts the chunks into the array, including the separators (so $parts[0] is "file", $parts[1] is "-" etc). Then $parts[2], $parts[4] and $parts[6] are the date ingredients.

suneeladdala · 12-05-2007, 08:19 AM

Cool Ross,
Thanks a lot...much appreciated.

archtoad6 · 12-13-2007, 03:58 PM

What is wrong w/ the existing script?
Is is it in some way broken?
Are you trying to fix it or perhaps make it more sophisticated?

Why the change in requirements?

Please post the old script.