how to scp files only after they've been there for X seconds

anon091 · 08-17-2011, 02:59 PM

Hi everybody. I have a program that outputs files to a certain directory, lets say /data/output

only problem is i need to copy those files to another server. What i'm hoping to do is scp them to that other server, then move them out of /data/output to /data/SentToOtherServer after they are scp'd. the only problem is that the files are huge so take a long time to write into /data/output and i dont want it grabbing the files before they are fully written. is there something i can do timestamp-wise so they aren't scp'd unless a timestamp on that file is older than say 30 seconds which would mean its finished writing?

unSpawn · 08-17-2011, 04:30 PM

How about instead using inotify watching files for "close_write" event?

anon091 · 08-17-2011, 04:38 PM

I'm not familiar with inotify, never even heard of it until I read your post. there is a man page for it on my server, but I cant say i really understand how to use it.

unSpawn · 08-18-2011, 12:25 AM

While you have asked no question try 'inotifywait --monitor --recursive --quiet --csv --event close_write /data/output' as an example.

anon091 · 08-18-2011, 11:15 AM

OK, looking at the man for inotify I kinda get what that would do. But how do i tie these commands into a .sh that would scp the file(s) then move them to another folder? I can write the scp command and the mv command, but dont know how to tie it all together with not grabbing files before they are written. could anyone provide an example?

anon091 · 08-18-2011, 11:34 AM

or is there something like cmin that works for seconds where you can use that to do a find with?

Nominal Animal · 08-18-2011, 04:21 PM

Quote:

Originally Posted by rjo98

OK, looking at the man for inotify I kinda get what that would do. But how do i tie these commands into a .sh that would scp the file(s) then move them to another folder? I can write the scp command and the mv command, but dont know how to tie it all together with not grabbing files before they are written. could anyone provide an example?

If you use inotifywait to detect close_write, you get the event after the writer has closed the file. There is no need for any additional waiting; the file has been closed (for writing) already, and should therefore be ready for copying.

inofitywait is part of the inotify-tools package, and has its own man page (after you install the package). The inotify man pages describe the API, whereas inotifywait is a shell command.

Try running this in the /data/output directory:

Code:

inotifywait -mrq -e close_write --format '%w%f' . | xargs -I COMPLETEDFILE scp COMPLETEDFILE user@remote:path/COMPLETEDFILE

It will transfer just one file at a time, but keeps the directory structure intact. If you want (unlimited) parallel SCP's, you could use

Code:

inotifywait -mrq -e close_write --format '%w%f' . | while read FILE ; do ( scp "$FILE" "user@remote:path/$FILE" & ) ; done

For a reliable service, you do need something a bit more complex. At startup, I'd check the names, sizes, and SHA1SUMs of all local files, and compare them to remote ones. You will need to buffer the inotifywait output somehow, to make sure you won't miss any events; it has a limited-size buffer, and will discard events if you don't process them fast enough.

You might wish to take a look at the incron package.

anon091 · 08-18-2011, 04:43 PM

Thanks for some examples Nominal. Combining commands always confuses me. couple questions though, if i would be doing this from a .sh file (i'm assuming) how do i force it to run in the /data/output directory all the time? I also need to move the file to another folder once its been SCP'd to that other server, but i dont think something like that is in the example, or is it?

Appreciate the example and help.

Nominal Animal · 08-18-2011, 09:40 PM

Quote:

Originally Posted by rjo98

Thanks for some examples Nominal. Combining commands always confuses me. couple questions though, if i would be doing this from a .sh file (i'm assuming) how do i force it to run in the /data/output directory all the time? I also need to move the file to another folder once its been SCP'd to that other server, but i dont think something like that is in the example, or is it?

In a shell script, use cd to change the current working directory, like you always do. It is process-specific (working directory is private to each process), so changing the process directory in one process does not change it in any other process.

Consider the following Bash script.

Code:

#!/bin/bash

# Directory watched for completed files.
# Subdirectories are not watched.
INCOMING=/data/output

# SCP target for files.
# Note: all files end up in this same directory.
# Password authentication will not work, you need
# to set up authentication keys.
REMOTE=user@remote:/directory/

# Directory scp'd files are moved to.
# Note: all files end up in this same directory.
COMPLETED=/data/sent

# Extra SCP options. Use blowfish, only try 5 secs to connect.
SCPOPTS=(-c blowfish -o ConnectTimeout=5)

# Paths are relative to INCOMING directory.
cd "$INCOMING" || exit $?

# Wait for completed files in the INCOMING directory,
inotifywait -mq -e close_write --format '%f' . | while read FILE ; do

        # Only consider normal files.
        [ -f "$FILE" ] || continue

        # Try to transfer the file(s) using SCP.
        if ! scp "${SCPOPTS[@]}" "$FILE" "$REMOTE" ; then
                printf '%s scp-failure %s\n' "$(date '+%Y-%m-%d %T %z')" "$FILE"
                continue
        fi

        # SCP was successful. Move the file. May overwrite an old one.
        if ! mv -f "$FILE" "$COMPLETED" ; then
                printf '%s mv-failure %s\n' "$(date '+%Y-%m-%d %T %z')" "$FILE"
                continue
        fi

        # Success.
        printf '%s success %s\n' "$(date '+%Y-%m-%d %T %z')" "$FILE"
done

It will output a list of files (closed after being open for writing). The first (three) fields will contain the date, time, and timezone (numeric). The fourth field will contain 'success', 'scp-failure', or 'mv-failure'. The fifth field will contain the file name.

The script will never exit by itself; you need to kill it via e.g.

Code:

kill -HUP $(ps -C inotifywait -o pid=)

but if you have more than one running, that will kill all of them.

It is quite possible to extend the above around some job or script, so that close_write events are only watched while the other job/script runs, and afterwards everything is cleaned up -- including scp'ing and copying any files the monitoring might have missed. That will make the script even more complicated, though. You should also consider what to do with errors, for example if you run out of disk space. Should you just output the error, or should you send an e-mail message? Note that inotifytools package is not installed by default on most Linux distributions. If you are a Linux cluster user, first contact your cluster admins to ask if inotifytools is installed, and if/which command-line utility you can use to send mail from compute nodes. E-mail is not always possible from compute nodes, or may only be possible via a specific command-line client, e.g. /bin/sendmail.

anon091 · 08-19-2011, 09:34 AM

wow, that pretty intense, and impressive! I never would have figured any of that out haha. do those printf's just put stuff up on the screen? I'm guess I already have the inotifytools installed because i was able to pull up man pages for the stuff.

So since i would cron this, I really dont need the printf's if they just write to the screen, since nobody would see them as this would run constantly in the background?

Guess there's a lot more to think about then what i posted simply in my original post!!

anon091 · 08-19-2011, 09:42 AM

My original line of thought was to somehow do a find like

find /data/output/* -type f -cmin +1

even though i'd really like to do it like right after the file is closed or a few seconds after, kinda like what this inotify stuff does. then do the scp command, then move the file to /data/sent.

I guess that's kinda simplistic and doesnt account for errors, and its very glamourous. Plus i have no idea how to combine then all to work right. Just figured i'd give more background.

Nominal Animal · 08-20-2011, 11:12 AM

Quote:

Originally Posted by rjo98

do those printf's just put stuff up on the screen?

Yes, they are there only as informative output; you can just as well remove it altogether.

Quote:

Originally Posted by rjo98

So since i would cron this

Well, I wouldn't. Just remove the printfs, and let it run all the time. The script does not use that much RAM, and it only uses CPU time when something happens. (It does not busy-wait; it blocks/sleeps on waiting for input when there is nothing to do.)

You might add another script to cron, to do the same for files that have not been modified in the last N minutes (say, a few hours), so that you "catch" anything the monitoring missed, or could not transfer for some reason. Basically,

Code:

#!/bin/bash
cd /data/output || exit $?
find . -maxdepth 1 -type f -mmin +N -print0 | while read -d "" FILE ; do
        scp "$FILE" "user@remote:path/" || continue
        mv -f "$FILE" /data/completed
done

Note that it may be necessary to add a running flag (say /var/run/scp-backup.pid containing the PID of the running process), and check if another copy of the same script is running (still alive), if the transfers may take longer than your cron interval is. Otherwise cron may start another copy of the script while the old one is still running.

My personal approach to issues like this is much more careful than most. I tend to assume problems will occur, and try to handle them in an useful manner. Your initial idea might work well for you, without any issues, if you happen to select a large enough age limit. My environments tend to vary too much for a simple age limit to work reliably, so I've had to find more reliable methods. They are obviously a bit more complex, but I think their robustness more than makes up for the added complexity.

anon091 · 08-21-2011, 07:28 PM

Thanks for replying Nominal. I agree you're approach is probably better suited than my very basic idea from the get go, was just posting that to show my thought process.
I'm not sure how to let it run all the time in the background though. Also, if the server is restarted, would however you set that up automatically restart it as well so it would start doing the process?
I'm also confused by your "note that it..." as i'm not sure how you use a pid file and i thought you said not to cron it (even though I dont know how to make it run all the time like you said".

Sorry for all these questions, but i appreciate you answering them all.

anon091 · 08-22-2011, 01:53 PM

I'm just afraid this approach may be too far over my head, and i wouldnt be able to support it. but maybe after you answer those questions i'll understand it better. thanks again.

szboardstretcher · 08-22-2011, 01:58 PM

Here is a simple solution...

Quote:

Originally Posted by rjo98

Hi everybody. I have a program that outputs files to a certain directory, lets say /data/output

only problem is i need to copy those files to another server. What i'm hoping to do is scp them to that other server, then move them out of /data/output to /data/SentToOtherServer after they are scp'd. the only problem is that the files are huge so take a long time to write into /data/output and i dont want it grabbing the files before they are fully written. is there something i can do timestamp-wise so they aren't scp'd unless a timestamp on that file is older than say 30 seconds which would mean its finished writing?

Well,.. if you have a script that outputs the files, you could always use the && operator... which will run a command *only* after the previous one completes. So:

Code:

./write_files_to_directory.sh && scp * 192.168.1.1:/tmp/

Would write the files to the directory,.. then when complete, SCP them to 192.168.1.1:/tmp/