Deleting a Directory with millions of files and sub directories

ramsforums · 08-22-2015, 09:18 PM

Greetings everyone

My Laptop was reporting limited storage space. So I installed Bleachbit and executed with root privileges. I have enabled all options including WIPE. It ran overnight for 12 hours and application became unresponsive. So I have pressed the power button for 10 seconds it power down the laptop.

Upon booting I saw a directory T6o1L9lGg- which contains huge number of files. I am unable to list using ls command

tried to delete using sudo rm -r T6o1L9lGg-
Not working. it runs for ever.

Tried the following command
mkdir empty
sudo rsync -a --delete ./empty/ ./T6o1L9IGg-/

This also not working ran over two hours and continue to run.

So I came across the following thread

http://www.stevekamerman.com/2008/03...#comment-16588

Here a David Villegas posting

Code:

David Villegas
September 1st, 2013 at 1:03 am
After a nightmare on a server with free space but out of inodes, because of spam attack on a misconfigured postfix… i found this blog post!, in my case was 8 million files on /var/spool/postfix/maildrop folder.
The only useful formula that works on my case was the Mikhus script on php.
find, rsync, rm fail with that ammount of files 
but i wrote a similar c++ script, to speed up the delete.
HOW:
just copy and paste this code to /var/spool/postfix/mydel.c
and compile with g++ -o mydelexec mydel.c
CODE:
//COMPILE WITH: g++ -o mydelexe mydel.c
#include
#include
int main() {
struct dirent *d;
DIR *dir;
char buf[256];
int i;
printf(“***mydelexe***\n”);
dir = opendir(“maildrop”);
while( d = readdir(dir) ) {
i++;
if(i%100==0) {
printf(“%s %i deleted!\n”,d->d_name,i);
}
sprintf(buf, “%s/%s”, “maildrop”, d->d_name);
remove(buf);
}
return 0;
}
This c do the same of the php, but a bit faster… for my 8 million files gone after few hours

After compiling I got the following

Code:

rama@develop ~ $ g++ -o mydelexec mydel.c
mydel.c:2:9: error: #include expects "FILENAME" or <FILENAME>
 #include
         ^
mydel.c:3:9: error: #include expects "FILENAME" or <FILENAME>
 #include
         ^
mydel.c:13:2: error: stray ‘\342’ in program
  printf(“***mydelexe***\n”);
  ^
mydel.c:13:2: error: stray ‘\200’ in program
mydel.c:13:2: error: stray ‘\234’ in program
mydel.c:13:2: error: stray ‘\’ in program
mydel.c:13:2: error: stray ‘\342’ in program
mydel.c:13:2: error: stray ‘\200’ in program
mydel.c:13:2: error: stray ‘\235’ in program
mydel.c:14:2: error: stray ‘\342’ in program
  dir = opendir(“maildrop”);
  ^
mydel.c:14:2: error: stray ‘\200’ in program
mydel.c:14:2: error: stray ‘\234’ in program
mydel.c:14:2: error: stray ‘\342’ in program
mydel.c:14:2: error: stray ‘\200’ in program
mydel.c:14:2: error: stray ‘\235’ in program
mydel.c:19:4: error: stray ‘\342’ in program
    printf(“%s %i deleted!\n”,d->d_name,i);
    ^
mydel.c:19:4: error: stray ‘\200’ in program
mydel.c:19:4: error: stray ‘\234’ in program
mydel.c:19:4: error: stray ‘\’ in program
mydel.c:19:4: error: stray ‘\342’ in program
mydel.c:19:4: error: stray ‘\200’ in program
mydel.c:19:4: error: stray ‘\235’ in program
mydel.c:21:2: error: stray ‘\342’ in program
  sprintf(buf, “%s/%s”, “maildrop”, d->d_name);
  ^
mydel.c:21:2: error: stray ‘\200’ in program
mydel.c:21:2: error: stray ‘\234’ in program
mydel.c:21:2: error: stray ‘\342’ in program
mydel.c:21:2: error: stray ‘\200’ in program
mydel.c:21:2: error: stray ‘\235’ in program
mydel.c:21:2: error: stray ‘\342’ in program
mydel.c:21:2: error: stray ‘\200’ in program
mydel.c:21:2: error: stray ‘\234’ in program
mydel.c:21:2: error: stray ‘\342’ in program
mydel.c:21:2: error: stray ‘\200’ in program
mydel.c:21:2: error: stray ‘\235’ in program
mydel.c: In function ‘int main()’:
mydel.c:9:1: error: ‘DIR’ was not declared in this scope
 DIR *dir;
 ^
mydel.c:9:6: error: ‘dir’ was not declared in this scope
 DIR *dir;
      ^
mydel.c:13:15: error: ‘mydelexe’ was not declared in this scope
  printf(“***mydelexe***\n”);
               ^
mydel.c:13:27: error: ‘n’ was not declared in this scope
  printf(“***mydelexe***\n”);
                           ^
mydel.c:13:31: error: ‘printf’ was not declared in this scope
  printf(“***mydelexe***\n”);
                               ^
mydel.c:14:19: error: ‘maildrop’ was not declared in this scope
  dir = opendir(“maildrop”);
                   ^
mydel.c:14:30: error: ‘opendir’ was not declared in this scope
  dir = opendir(“maildrop”);
                              ^
mydel.c:16:24: error: ‘readdir’ was not declared in this scope
  while( d = readdir(dir) ) {
                        ^
mydel.c:19:14: error: expected primary-expression before ‘%’ token
    printf(“%s %i deleted!\n”,d->d_name,i);
              ^
mydel.c:19:15: error: ‘s’ was not declared in this scope
    printf(“%s %i deleted!\n”,d->d_name,i);
               ^
mydel.c:19:35: error: invalid use of incomplete type ‘struct main()::dirent’
    printf(“%s %i deleted!\n”,d->d_name,i);
                                   ^
mydel.c:8:8: error: forward declaration of ‘struct main()::dirent’
 struct dirent *d;
        ^
mydel.c:21:18: error: expected primary-expression before ‘%’ token
  sprintf(buf, “%s/%s”, “maildrop”, d->d_name);
                  ^
mydel.c:21:19: error: ‘s’ was not declared in this scope
  sprintf(buf, “%s/%s”, “maildrop”, d->d_name);
                   ^
mydel.c:21:21: error: expected primary-expression before ‘%’ token
  sprintf(buf, “%s/%s”, “maildrop”, d->d_name);
                     ^
mydel.c:21:45: error: invalid use of incomplete type ‘struct main()::dirent’
  sprintf(buf, “%s/%s”, “maildrop”, d->d_name);
                                             ^
mydel.c:8:8: error: forward declaration of ‘struct main()::dirent’
 struct dirent *d;
        ^
mydel.c:21:53: error: ‘sprintf’ was not declared in this scope
  sprintf(buf, “%s/%s”, “maildrop”, d->d_name);
                                                     ^
mydel.c:22:12: error: ‘remove’ was not declared in this scope
  remove(buf);
            ^

I have changed #include to #include <studio.h> etc. still not working

1) How do I delete the directory?
2) How do I make the code work?

Thanks

norobro · 08-22-2015, 11:59 PM

Quote:

Originally Posted by ramsforums

I have changed #include to #include <studio.h> etc. still not working

Did you mean <stdio.h> ?

You also need to include two additional headers (from man opendir or man readdir):

Code:

#include <sys/types.h>
#include <dirent.h>

The "error: stray ..." errors are caused by unicode characters copied from your browser. Notice the quotation marks are slanted (”)? Changing them to regular quotes (") should fix those errors.

HTH

ramsforums · 08-23-2015, 07:07 PM

Quote:

Originally Posted by norobro

Did you mean <stdio.h> ?

You also need to include two additional headers (from man opendir or man readdir):

Code:

#include <sys/types.h>
#include <dirent.h>

The "error: stray ..." errors are caused by unicode characters copied from your browser. Notice the quotation marks are slanted (”)? Changing them to regular quotes (") should fix those errors.

HTH

Thanks norobro. I could able to compile. I have modified the code as follows. When I ran the application it is getting executed but it is not deleting the files and directory. What could be wrong?

Code:

rama@develop ~ $ cat mydel.c 
//COMPILE WITH: g++ -o mydelexe mydel.c
#include <stdio.h>
#include <sys/types.h>
#include <dirent.h>

int main(int argc, char *argv[]) 
{
struct dirent *d;
DIR *dir;
char buf[256];
int i;
	if (argc < 1)
	{
	printf("*** Usage:  mydelexe [folder]  *** \n\r");
        return(1);
	}

	printf("***mydelexe %s ***\n\r", argv[1]);
	dir = opendir(argv[1]);

	while( d = readdir(dir) ) {
		i++;
		if(i%100==0) {
			printf("%s %i deleted!\n\r",d->d_name,i);
			}
	sprintf(buf, "%s/%s \n\r", argv[1], d->d_name);
	remove(buf);
	}
	return 0;
}

rknichols · 08-23-2015, 08:48 PM

You call a function like remove() without checking the return value and errno, and then ask someone to tell you why it didn't work. Really?

Code:

     if(remove(buf) < 0)  perror(buf);

You'll need "#include <errno.h>" to let that compile.

For that matter, the return values from opendir() and readdir() should also be checked.

jpollard · 08-23-2015, 10:09 PM

The problem is the directory with a huge number of files.

The easiest way to deal with that is to let an "/usr/bin/rm -rf <directorypath>" run as long as it takes.(I suggest using a virtual terminal for this, you could do a "nohup /usr/bin/rm -rf <directorypath> >/tmp/nohup.out 2>&1 </dev/null &"

The problem is the way rm works and the interaction with certain filesystems. Those filesystems using a btree structure for the directory work the fastest, but most just use a linear list for the directory.

When a file gets deleted, the kernel has to copy the rest of the entries up one place in directory file. Then repeat for the next file... rm starts with the very first file - thus, the worst case delete.

You CAN make it run faster... but it depends on reading the directory list into memory, then reverse the order...
and delete each file. This is much faster because you avoid the kernel having to copy the remaining list of files.

Note: doing this with a million files takes a fairly large amount of memory. When I did it , I used perl. As I recall it went something like:

Code:

$some_dir = "directory with lots of files"; # this can be "." if you first do a cd to the directory...
opendir(my $dh, $some_dir) || die;
@list = reverse readdir($dh)
closedir($dh);
while (defined ($f = shift(@list)) ) {
    next if (-d $f);
    unlink $f || die "can't delete $f - $!\n";
}

Note - I have not tested this, it is from memory. How well it works depends on the filesystem. I'm not sure that it will help on btrfs (it uses a btree for the directory, so it should be fast anyway).

This particular bit of code will not delete directories (so . and .. will be left alone).

rknichols · 08-23-2015, 10:35 PM

Quote:

Originally Posted by jpollard

When a file gets deleted, the kernel has to copy the rest of the entries up one place in directory file. Then repeat for the next file... rm starts with the very first file - thus, the worst case delete.

I don't know of any filesystem that works that way. Can you give an example of one? For all the filesystems I know with simple, unordered linear mapping, removing an entry leaves a hole and no moving of other entries is done. If you are scanning through the directory with readdir(), it would appear that the directory had been re-packed, but that's just because readdir() doesn't show you the empty spaces.

Really, starting with the last file would be the worst approach because when given a name, the kernel does have to perform a linear search from the beginning of the directory. If it's the last file of millions, that's going to take a while. Now, for a really primitive directory like the FAT variants (prior to EXFAT, at least), the unused entries have to be skipped over one-by-one, but for ext2/3/4 consecutive free areas are coalesced into a single block, so the skip just requires a single seek().

syg00 · 08-23-2015, 11:19 PM

The rsync trick has worked for me - not multi-millions tho'.
Reading the entire list into memory is the problem with ls and rm taking so long. this seems the best approach.

syg00 · 08-24-2015, 12:40 AM

Quote:

Originally Posted by ramsforums

My Laptop was reporting limited storage space. So I installed Bleachbit and executed with root privileges. I have enabled all options including WIPE. It ran overnight for 12 hours and application became unresponsive. So I have pressed the power button for 10 seconds it power down the laptop.

This is never a good idea - especially on a program that is potentially bypassing the filesystem and writing directly to disk.

Is the filesystem clean ?. Will fsck run - will it finish ?. How do you know you don't have recursive entries ?.
All the solutions offered presume a valid filesystem that just happens to have a boatload of file/directory entries. If the filesystem is innately broken, the only solution might be mkfs. And that of course means you have to somehow get your good data off first.

I suppose expecting a good, recent backup that you could simply restore from would be unrealistic ?.

jpollard · 08-24-2015, 05:55 AM

Quote:

Originally Posted by rknichols

I don't know of any filesystem that works that way. Can you give an example of one? For all the filesystems I know with simple, unordered linear mapping, removing an entry leaves a hole and no moving of other entries is done. If you are scanning through the directory with readdir(), it would appear that the directory had been re-packed, but that's just because readdir() doesn't show you the empty spaces.

Really, starting with the last file would be the worst approach because when given a name, the kernel does have to perform a linear search from the beginning of the directory. If it's the last file of millions, that's going to take a while. Now, for a really primitive directory like the FAT variants (prior to EXFAT, at least), the unused entries have to be skipped over one-by-one, but for ext2/3/4 consecutive free areas are coalesced into a single block, so the skip just requires a single seek().

Nearly all of them. The problem is that a directory entry is always a variable length. Leaving holes in a directory causes searching problems in that you are constantly seeking the next entry. You have to search to find a hole big enough for a new entry (and the garbage handling for splitting/merging).

A directory search is far faster than trying to handle the garbage involved. This was why btree structures are being used to make it manageable. I did forget about hash tree directories, those are another this isn't helpful for. It shouldn't hurt though.

The only thing this method is trying to avoid is the inherent garbage collection for a directory. It cannot avoid the search - that will happen no matter what method is used.

Remember, the garbage is in memory, (and disk, though the disk doesn't get involved as much). The constant rearranging of the directory in memory is what is being avoided. Deleting the last entry has the least amount of rearranging happening. A btree list still has balancing actions, so the method is less helpful. This was why I also gave the simple approach - it will work in all situations, though for some, it is very slow. The reverse order delete is a bit more setup time (reversing the order), but will work everywhere but is most helpful on the linear directories.

rknichols · 08-24-2015, 09:59 AM

Quote:

Originally Posted by jpollard

Nearly all of them.

That is not an example of one.

Quote:

The problem is that a directory entry is always a variable length. Leaving holes in a directory causes searching problems in that you are constantly seeking the next entry. You have to search to find a hole big enough for a new entry (and the garbage handling for splitting/merging).

Nonetheless, that's how it works, at least in ext2/3/4. Here are some dumps of a directory on an ext2 filesystem built without the "dir_index" option.

First, after "touch filename1 longerfilename2 file3"

Code:

00000000   02 00 00 00  0C 00 01 02  2E 00 00 00  02 00 00 00  ................
00000010   0C 00 02 02  2E 2E 00 00  0B 00 00 00  14 00 0A 02  ................
00000020   6C 6F 73 74  2B 66 6F 75  6E 64 00 00  0C 00 00 00  lost+found......
00000030   14 00 09 01  66 69 6C 65  6E 61 6D 65  31 00 00 00  ....filename1...
00000040   0D 00 00 00  18 00 0F 01  6C 6F 6E 67  65 72 66 69  ........longerfi
00000050   6C 65 6E 61  6D 65 32 00  0E 00 00 00  A8 03 05 01  lename2.........
00000060   66 69 6C 65  33 00 00 00  00 00 00 00  00 00 00 00  file3...........

Next, "rm filename1; touch longerfilename4"

Code:

00000000   02 00 00 00  0C 00 01 02  2E 00 00 00  02 00 00 00  ................
00000010   0C 00 02 02  2E 2E 00 00  0B 00 00 00  28 00 0A 02  ............(...
00000020   6C 6F 73 74  2B 66 6F 75  6E 64 00 00  00 00 00 00  lost+found......
00000030   14 00 09 01  66 69 6C 65  6E 61 6D 65  31 00 00 00  ....filename1...
00000040   0D 00 00 00  18 00 0F 01  6C 6F 6E 67  65 72 66 69  ........longerfi
00000050   6C 65 6E 61  6D 65 32 00  0E 00 00 00  10 00 05 01  lename2.........
00000060   66 69 6C 65  33 00 00 00  0C 00 00 00  98 03 0F 01  file3...........
00000070   6C 6F 6E 67  65 72 66 69  6C 65 6E 61  6D 65 34 00  longerfilename4.

Note that all that has happened to the "filename1" entry is that its inode number has been zeroed. The new "longerfilename4" would not fit in that space, so it is added at the end.

Now, "rm longerfilename2; touch longerfilename5"

Code:

0000000   02 00 00 00  0C 00 01 02  2E 00 00 00  02 00 00 00  ................
00000010   0C 00 02 02  2E 2E 00 00  0B 00 00 00  14 00 0A 02  ................
00000020   6C 6F 73 74  2B 66 6F 75  6E 64 00 00  0D 00 00 00  lost+found......
00000030   2C 00 0F 01  6C 6F 6E 67  65 72 66 69  6C 65 6E 61  ,...longerfilena
00000040   6D 65 35 00  18 00 0F 01  6C 6F 6E 67  65 72 66 69  me5.....longerfi
00000050   6C 65 6E 61  6D 65 32 00  0E 00 00 00  10 00 05 01  lename2.........
00000060   66 69 6C 65  33 00 00 00  0C 00 00 00  98 03 0F 01  file3...........
00000070   6C 6F 6E 67  65 72 66 69  6C 65 6E 61  6D 65 34 00  longerfilename4.

Note that the space for the two deleted entries of length 20 (0x14) and 18 (0x18) have been combined into a single entry of length 44 (0x2c). The characters of the "longerfilename2" entry still remain, but they are within that same 44-byte entry and follow the terminal NUL of the "longerfilename5" name, which was able to fit in that combined space. The subsequent entries in the directory remain in exactly the same locations as before.

jpollard · 08-24-2015, 01:54 PM

It is still having to do garbage collection to merge empty spaces, plus the I/O...

Most of what I tried to avoid is the in memory scrambling that goes on. Starting at the end makes it simple. All that is needed is to merge into one big block.

jefro · 08-24-2015, 02:35 PM

I might boot to a live cd and check filesystem. (we'd like to know what it is by the way)

Then I'd just delete from the live boot.

rknichols · 08-24-2015, 02:36 PM

Starting at the end means that for every name you pass in an unlink() system call the kernel has to start at the beginning of the directory and do a string compare on every one of the millions of entries until it gets to the matching name. If you start at the beginning, then the first name it tests will match. Tell me again which of those is faster.

And I found one place I was wrong earlier. The merging of successive deleted entries is done only when the kernel is trying to find space to add a new entry. If all you are doing is deleting, you get successive deleted entries that have to be skipped over individually. That's a matter of seeing that the inode number is zero and then reading the rec_len field from the next 16 bits -- painful, but still not as bad as doing a string compare on the name field.

As to why things are left in such an inefficient state, I suspect the answer is, "You should be using dir_index."

rknichols · 08-24-2015, 02:40 PM

Quote:

Originally Posted by jefro

I might boot to a live cd and check filesystem. (we'd like to know what it is by the way)

Then I'd just delete from the live boot.

Sometimes it's best to just save everything you want to keep and then use mkfs to blow away the rest. Of course if you've got terabytes of data that you would have to save and restore, that's not a good solution either.

Lsatenstein · 08-24-2015, 06:21 PM

Quote:

Originally Posted by rknichols

Starting at the end means that for every name you pass in an unlink() system call the kernel has to start at the beginning of the directory and do a string compare on every one of the millions of entries until it gets to the matching name. If you start at the beginning, then the first name it tests will match. Tell me again which of those is faster.

And I found one place I was wrong earlier. The merging of successive deleted entries is done only when the kernel is trying to find space to add a new entry. If all you are doing is deleting, you get successive deleted entries that have to be skipped over individually. That's a matter of seeing that the inode number is zero and then reading the rec_len field from the next 16 bits -- painful, but still not as bad as doing a string compare on the name field.

As to why things are left in such an inefficient state, I suspect the answer is, "You should be using dir_index."

If the midpoint of the directory is not valid, I would guess that the system search algorithm would creep up to the next recognized entry and continue with some form of binary search (My BS).