Linux script Limiting Zip archive size

giantanakim · 11-03-2011, 07:20 AM

Hi Everyone:

I am trying to zip a large amount of wav files through a script. However, this zip file needs to be uploaded to a business partners FTP server, which limits files to 2gig max. They have things set up so that it has to be a zip file specifically.

Is there a way to script to take the contents of a folder, zip all the files into one zip file unless it is more than 2 gigs, in which case break it into 2 files, or more if necessary. I cannot use the split archives for this as each file needs to be stand alone.

Thanks everyone

MensaWater · 11-03-2011, 09:28 AM

zipsplit splits a zip file into multiple zip files and you can specify maximum size with the -n option.

http://wiki.linuxquestions.org/wiki/Zipsplit

giantanakim · 11-03-2011, 10:23 AM

I looked into zipsplit. It appears that zipsplit creates files which are dependent on each other (ie. file1.ro1, file2.ro2, etc) where all of the pieces are required to be present before the file can be unzipped.

Am I incorrect on this? The problem with that, if it is true, is that their system unzips a file as soon as it completed uploading. At that point, the next file upload overwrites the prior file as they all are required to be named the same.

If I am confused, please clear me up!

MensaWater · 11-03-2011, 11:18 AM

No - the zip files created by zipsplit are independent of each other.

As a test I created a zip file of a directory where I keep various scripts and binaries for personal use:

Code:

zip bin.zip *

That created the file bin.zip. I then ran unzip to a different directory to make sure that had all the files.

I then ran the following to break up the single zip into multiples:

Code:

zipsplit -n 23000 bin.zip

That output:
4 zip files will be made (100% efficiency)
creating: bin1.zip
creating: bin2.zip
creating: bin3.zip
creating: bin4.zip

I then did an scp of bin2.zip to another server and ran unzip there. This extracted the files in that zip file without asking about any of the others. I then did scp of bin4.zip and unzipped it with the same results. I then did it for bin1.zip then for bin3.zip and in all four cases the files unzipped cleanly without asking for any of the others. (As noted when I did bin2.zip there were no others on the system.)

Nominal Animal · 11-03-2011, 11:34 AM

The man page for zipsplit says it does not support files over 2G in size.

Quote:

Originally Posted by giantanakim

Is there a way to script to take the contents of a folder, zip all the files into one zip file unless it is more than 2 gigs, in which case break it into 2 files, or more if necessary.

zipsplit seems to do just that to a zip archive, but I'm not sure it works for such large archives.

You can pre-group your wav files beforehand, and then zip them separately. They are unlikely to compress much, so grouping each 1800 M to 2 G of wav files to be zipped in one archive should work.

Here is a simple Bash script you might start with:

Code:

#!/bin/bash
if [ $# -lt 3 -o "$1" == "-h" -o "$1" == "--help" ]; then
    exec >&2
    echo ""
    echo "Usage: $0 [ -h | --help ]"
    echo "       $0 name maxsize folder(s)..."
    echo ""
    echo "This program creates name.zip containing all files in the"
    echo "specified folders, unless their combined size exceeds maxsize."
    echo ""
    echo "When the combined file sizes exceed maxsize, this program"
    echo "will create name-1.zip, name-2.zip, and so on, with each"
    echo "archive containing at least one file, but archive contents"
    echo "not exceeding the maxsize. Each archive will contain full files."
    echo ""
    echo "Files are not reordered or sorted, so archive sizes may"
    echo "fluctuate wildly."
    echo ""
    exit 0
fi

# Base name of the zip archive to create
BASENAME="$1"
if [ -z "$BASENAME" ]; then
    echo "Empty zip archive name!" >&2
    exit 1
fi

# Maximum size for input files for each archive
case "$2" in
    *k|*K)           MAXTOTAL=$[ (${2//[^0-9]/} -0) * 1000 ] || exit $? ;;
    *kb|*kB|*Kb|*KB) MAXTOTAL=$[ (${2//[^0-9]/} -0) * 1024 ] || exit $? ;;
    *m|*M)           MAXTOTAL=$[ (${2//[^0-9]/} -0) * 1000000 ] || exit $? ;;
    *mb|*mB|*Mb|*MB) MAXTOTAL=$[ (${2//[^0-9]/} -0) * 1048576 ] || exit $? ;;
    *g|*G)           MAXTOTAL=$[ (${2//[^0-9]/} -0) * 1000000000 ] || exit $? ;;
    *gb|*gB|*Gb|*GB) MAXTOTAL=$[ (${2//[^0-9]/} -0) * 1073741824 ] || exit $? ;;
    *[^0-9]*)        echo "$2: Invalid maximum size." >&2
                     exit 1 ;;
    *)               MAXTOTAL=$[ $2 ] || exit $? ;;
esac
shift 2

find "$@" -type f -print0 | (

    # Current index, list of files, and total size of files
    INDEX=0
    FILES=()
    TOTAL=0

    while read -d "" FILE ; do
        SIZE=`stat -c %s "$FILE"` || continue

        NEWTOTAL=$[ SIZE + TOTAL ]
        if [ ${#FILES[@]} -lt 1 ] || [ $NEWTOTAL -le $MAXTOTAL ]; then
            FILES=("${FILES[@]}" "$FILE")
            TOTAL=$NEWTOTAL
            continue
        fi

        INDEX=$[ INDEX + 1 ]
        zip "$BASENAME-$INDEX.zip" "${FILES[@]}" || exit $?

        FILES=("$FILE")
        TOTAL=$SIZE
    done

    if [ ${#FILES[@]} -gt 0 ]; then
        if [ $INDEX -gt 0 ]; then
            INDEX=$[ INDEX + 1 ]
            zip "$BASENAME-$INDEX.zip" "${FILES[@]}" || exit $?
        else
            zip "$BASENAME.zip" "${FILES[@]}" || exit $?
        fi
    elif [ $INDEX -eq 0 ]; then
        echo "No files to zip specified." >&2
        exit 0
    fi

    echo "" >&2
    if [ $INDEX -gt 0 ]; then
        echo "Created $INDEX files:" >&2
        for I in `seq 1 $INDEX` ; do
            echo "    $BASENAME-$I.zip ($(stat -c %s "$BASENAME-$I.zip") bytes)" >&2
        done
    else
        echo "Created 1 file:" >&2
        echo "    $BASENAME.zip ($(stat -c %s "$BASENAME.zip") bytes)" >&2
    fi
    echo "" >&2
    exit 0
)
exit $?

Run it without arguments to get usage. You can use k/M/G suffixes for the maximum size (for kilo, mega, and giga respectively), but the number must be integer; no decimals!

This script will also never use the Zip64 extensions for large archives, even when the directory contents are several gigabytes (at least if the size limit is < 2G). The resulting archives should be unpackable with even old PKZIP programs. (If I remember correctly, the old ones may choke with zipsplit archives.)

If you have wildly different file sizes, you should consider writing an awk script, which uses one of the greedy algorithms for solving the bin packing problem (to efficiently decide which file should go in which zip archive). Similar problem is encountered when backing up large directories to write-once media (DVD-R discs, for example).

If you need each zip file to be as close to the limit as possible, you could use the zip -O option to add each file to the archive without overwriting the old one; if the result is smaller than the limit, then try adding the next file to it. However, since in the worst case you'd copy almost 2G (the archive size before adding the current file -- remember, it must keep the old archive intact in case the limit is exceeded) each time you add one file, it would be quite slow at times.

I hope this gets you started,

MensaWater · 11-03-2011, 12:31 PM

The man page is cryptic on this. The project site's FAQ seems to suggest the limitation is on files within the archive. That is to say I'm not sure if its limitation is on zip files larger than 2 GB or files contained with in the zip larger than 2 GB or both.

It can't hurt to try it if you don't have any files larger than 2 GB you are zipping up or if the zip is less than 2 GB anyway.

I'll have to admit I'm surprised by such a limitation on the zip file size itself. It seems the most likely users of zipsplit would be those with large zip files and these days 2 GB is nothing.

giantanakim · 11-03-2011, 03:12 PM

Ok, so I am attempting to use zipsplit. However, every time I try this command:

Quote:

zipsplit -n 23000 bin.zip

It gives me the following response:

zipsplit warning: Entry is larger than max split size of :
zipsplit warning: use -n to set split size
zipsplit error: Entry too big to split, read, or write

I have entered in multiple sizes, and nothing has work. In addition, I tried to remove the -n 23000 from the command, and still received the same error. This is a 43MB test file.

Any suggestions?

Cedrik · 11-03-2011, 04:05 PM

Why not use the nice looking script posted above by Nominal Animal ?

MensaWater · 11-03-2011, 04:50 PM

bin.zip was an arbitrary name I chose for my file because it was zip of a bin directory.

The errors are essentially blank presumably because it can not find a file named bin.zip. Odd though - I'd have thought they'd tell you file not found rather than do all that.

You need to run the command on the zip filename YOU created. Since you seem to have created it frequently prior to posting I'm assuming it is NOT named bin.zip (or it is one hell of a coincidence if it is).

The generic syntax for what I did is:

zipsplit -n <size> <filename>

Where you substitute the size in bites for <size> and filename of YOUR zip file for <filename>.

Nominal Animal · 11-04-2011, 04:54 AM

Quote:

Originally Posted by giantanakim

zipsplit warning: use -n to set split size
zipsplit error: Entry too big to split, read, or write

If you do not specify -n size it uses a default. You can see the default if you run zipsplit without arguments; in my case, it is 36000 bytes.

Quote:

Originally Posted by giantanakim

zipsplit error: Entry too big to split, read, or write

This means that either one of the split zip files would have to be larger than the maximum size specified (typically because there is at least one file that when compressed alone, is larger than the maximum size specified), or that the archive (or a single file) is larger than the 2G limit for zipsplit.

The limitations for zipsplit are due to the fact that it is a very old 32-bit format, and you can only describe lengths of up to 2147483647 bytes exactly using a signed 32-bit*integer. (Where unsigned integers are used, the limit is of course 4G, or 4294967295 bytes).

To overcome the limitations, Zip64 extensions were developed, using 64 bit integers (theoretically, 9223372036854775807 byte limit for signed 64-bit integers). However, zipsplit does not support those extensions. (I suspect there may be a technical reason, since the man page indicates zipsplit uses different extensions for splitting the zip archive; if those extensions are not 64-bit, then you cannot support 64-bit archives. It seems that PKZIP has only relatively recently grown support for zipsplit extensions, too.)

Quote:

Originally Posted by giantanakim

Any suggestions?

Try my script, dangit, or at least tell me why it does not suit your needs. It works for me.

giantanakim · 11-07-2011, 12:28 PM

@Mensawater:

I was not actually using the bin.zip in the actually command, but did not want to specifically post the name of the file. It was still giving the error. I have done some reading and it appears that the newer versions of Ubuntu have a bug regarding zipsplit.

@Nominal Animal

I would love to use your script. I am trying to understand it, as it will be connecting to other scripts in sequence. It is a bit more complex than I am used to at this point, so any help that you could provide in understanding how it works would be great.

Nominal Animal · 11-07-2011, 07:12 PM

Here's the script I listed above explained part by part.

If the script is called with less than three arguments, or the first argument is -h or --help, the script outputs some usage information to standard error:

Code:

#!/bin/bash
if [ $# -lt 3 -o "$1" == "-h" -o "$1" == "--help" ]; then
    exec >&2
    echo ""
    echo "Usage: $0 [ -h | --help ]"
    echo "       $0 name maxsize folder(s)..."
    echo ""
    echo "This program creates name.zip containing all files in the"
    echo "specified folders, unless their combined size exceeds maxsize."
    echo ""
    echo "When the combined file sizes exceed maxsize, this program"
    echo "will create name-1.zip, name-2.zip, and so on, with each"
    echo "archive containing at least one file, but archive contents"
    echo "not exceeding the maxsize. Each archive will contain full files."
    echo ""
    echo "Files are not reordered or sorted, so archive sizes may"
    echo "fluctuate wildly."
    echo ""
    exit 0
fi

First parameter specifies the name of the ZIP archive to create (sans .zip suffix).

Code:

# Base name of the zip archive to create
BASENAME="$1"
if [ -z "$BASENAME" ]; then
    echo "Empty zip archive name!" >&2
    exit 1
fi

Second command line parameter specifies the maximum total size for files in a single archive. ${2//[^0-9]/} evaluates to only the digits in the second parameter. If the second parameter is foo.2-bar/4 then ${2//[^0-9]/} evaluates to 24 . A case statement is used to catch the known multipliers, so that MAXTOTAL will end up being the actual limit in bytes. If the conversion calculation fails, the script aborts. (Bash will issue an error message describing the problem.)

Code:

# Maximum size for input files for each archive
case "$2" in
    *k|*K)           MAXTOTAL=$[ (${2//[^0-9]/} -0) * 1000 ] || exit $? ;;
    *kb|*kB|*Kb|*KB) MAXTOTAL=$[ (${2//[^0-9]/} -0) * 1024 ] || exit $? ;;
    *m|*M)           MAXTOTAL=$[ (${2//[^0-9]/} -0) * 1000000 ] || exit $? ;;
    *mb|*mB|*Mb|*MB) MAXTOTAL=$[ (${2//[^0-9]/} -0) * 1048576 ] || exit $? ;;
    *g|*G)           MAXTOTAL=$[ (${2//[^0-9]/} -0) * 1000000000 ] || exit $? ;;
    *gb|*gB|*Gb|*GB) MAXTOTAL=$[ (${2//[^0-9]/} -0) * 1073741824 ] || exit $? ;;
    *[^0-9]*)        echo "$2: Invalid maximum size." >&2
                     exit 1 ;;
    *)               MAXTOTAL=$[ $2 ] || exit $? ;;
esac

Since the two first parameters have been taken care of, we remove them. Third command line parameter will become first, fourth the second, and so on.

Code:

shift 2

To find all the files to archive, we use a find command, using the command line parameters as arguments to it (except the two parameters we handled above already). File names will be separated by ASCII NULs. The list is supplied as input to a subshell. (You can just think of it as a separate scope. Changes to e.g. variables are not propagated outside the subshell.)

Code:

find "$@" -type f -print0 | (

    # Current index, list of files, and total size of files
    INDEX=0
    FILES=()
    TOTAL=0

Each file name is read in a loop, and the size of that file put into variable SIZE. If the file does not exist or is otherwise unreadable, it is skipped.

Code:

    while read -d "" FILE ; do
        SIZE=`stat -c %s "$FILE"` || continue

TOTAL is the current running total. Calculate the new total, if the file were to be included into the current zip archive.

Code:

        NEWTOTAL=$[ SIZE + TOTAL ]

If the current archive is empty, or the new total does not exceed the maximum size limit, add the current file to the list of files to be archived in the current zip archive, and continue on to the next file.

Code:

        if [ ${#FILES[@]} -lt 1 ] || [ $NEWTOTAL -le $MAXTOTAL ]; then
            FILES=("${FILES[@]}" "$FILE")
            TOTAL=$NEWTOTAL
            continue
        fi

At this point we know that the current file cannot be added to the existing zip archive. Since we already know the files in the current zip archive, lets create the zip archive now. Note that INDEX was initially zero. We increase it first, so that we get -1 for the first archive, -2 for the second, and so on. If the zip command fails, abort the script.

Code:

        INDEX=$[ INDEX + 1 ]
        zip "$BASENAME-$INDEX.zip" "${FILES[@]}" || exit $?

We have taken care of all the previous files. The current file is the first file in a new archive; start a new list (and running total size) for the new archive.

Code:

        FILES=("$FILE")
        TOTAL=$SIZE

The file handling loop is complete:

Code:

    done

If there are files yet to be archived, create a new archive name. If we have not created any archives at all yet, there is no need to use the -1 in the name. Use the same zip command as above to create the archive in either case.

Code:

    if [ ${#FILES[@]} -gt 0 ]; then
        if [ $INDEX -gt 0 ]; then
            INDEX=$[ INDEX + 1 ]
            zip "$BASENAME-$INDEX.zip" "${FILES[@]}" || exit $?
        else
            zip "$BASENAME.zip" "${FILES[@]}" || exit $?
        fi

Otherwise, if INDEX is still zero, there were no files to archive at all. Abort the script if no archive was created.

Code:

    elif [ $INDEX -eq 0 ]; then
        echo "No files to zip specified." >&2
        exit 0
    fi

I like to tell the user explicitly what the script does. If INDEX is nonzero, we have created archives 1 through INDEX:

Code:

    echo "" >&2
    if [ $INDEX -gt 0 ]; then
        echo "Created $INDEX files:" >&2
        for I in `seq 1 $INDEX` ; do
            echo "    $BASENAME-$I.zip ($(stat -c %s "$BASENAME-$I.zip") bytes)" >&2
        done

otherwise we have created just the one (versionless) archive.

Code:

    else
        echo "Created 1 file:" >&2
        echo "    $BASENAME.zip ($(stat -c %s "$BASENAME.zip") bytes)" >&2
    fi
    echo "" >&2
    exit 0

Note that above, stat -c %s filename outputs the size of filename in bytes. Because we want to insert the output of that command, we enclose it in $(...). The >&2 just redirects the output of that command to standard error instead of standard output. It is not important, just something I like to do: it is nice to be able to pipe the output to another command, but see the summary on-screen when running the script (since standard error is usually directed to the terminal, even if standard output is piped to another command).

Use the subshell exit status for the entire script. If the subshell succeeded, the script will also return success.

Code:

)
exit $?

If there are some specific things you wish me to clarify, just let me know.

giantanakim · 11-08-2011, 01:16 PM

Wow! I really appreciate the clarification on that Nominal Animal! The script works flawlessly. I have saved it into a script file. Can I just call it from another script with the line ./nominalanimal.sh Recordings 2G ./ ? That line should provide all needed properties correct?

Don't answer that, I will figure it out myself!

Thanks again for all of your help.

giantanakim · 11-09-2011, 08:06 AM

Ok, a couple things just came up.

1. I ran this script on 5g worth of files. It split this up into three different zip files. -1 was 1.3G, -2 was 800MB, and -3 was 2.39GB. I am not sure why it did that. Any thoughts?

2. Can I run this for another directory? For instance, if I put in ./nominalanimal.sh /home/foo/bar/recordings will it create the zip files in the specified directory? And will it zip the files contained within that directory?

3. Can this script be modified so that once a file is added to the zip, the original can be moved to a different directory?

michaelk · 11-09-2011, 08:27 AM

It depends on the type of files. As stated binary files like audio will not compress much. Since zip files contain additional information by compressing many binary files the actual result might be bigger then the original source. And since the final zip file size is a bit of an unknown you will need to adjust max size parameter.