How To Delete Many and Large Files in a Multitude of Directories

A couple of days ago I scambled together some notes on Backup and Synchronisation and spoke of the need to exclude certain directories (e.g., .gvfs, .wine) from synchronisation. Well, it turns out I ate my own dogfood. Backing up a desktop machine to a central server I discovered that I had did not exclude a .wine directory; it hadn't been used for a long time, I had uninstalled the program and frankly, I had forgotten about it. Well, a few hundred gigabytes later.

The problem was that the damn thing had recursed into itself (and this is what the bloody things do); .wine included a reference to my home directory, which included a reference to .wine (and .gvfs, btw), and so on. The end result was a directory which was probably at least a dozen levels deep, if not more.

So having receive a waggly finger from my senior systems administrator, I went to the command line and typed: rm -rf .wine in the appropriate directory... and waited. This was apparently going to take quite a while. It is well nown that ext3 is particularly bad filesystem for such mass deletions, but that's small comfort for those who find themselves in such a situation.

Now one option is to run it the background (e.g., rm -rf .wine &) if you don't really care about system resources being hogged. Or you can use something like ionice (e.g., ionice -n7 nice rm -rf .wine); the issue will be the i/o, not the scheduler and in fact using nice rather than ionice may make the issue worse.

Fortunately there is a lot of ways that deletion can occur with speed and without hammering one's machine.

The main issue is that rm -rf [directory] traverses through the filesystem reading ionode data, carrying out it's job and moving on to the next file. Part of this includes running opendir(), readdir(), unlink(); the speed of the first two depends on the number of files in the directory, the speed of the last on the size of the file. If you have a lot of files and directories and many of a large size, this is going to take a long time.

While discussing directory sizes etc, it's worth noting that running something like rm -rf * in the directory you want to delete isn't always the best option either. The wildcard will expand to all the files in the directory and rm will die, according to the shell limits with the statement: /bin/rm: Argument list too long.

One method suggested is to run rm in parallel. Running rm, such as suggested above, is problematic as it will be run on every every file in the directory - but one can run several (or even a dozen or more) at once, one after another. Here is one such solution, which runs ls as a process input into a short while loop, with ten parallel jobs - although this will have problems with files with spaces in their names etc, due to using the output of the ls process as a data input. For the purpose of deleting ~/backup/.wine/, it worked like a dream.

cd ~/backup/.wine
while read X; do while [ "$(jobs | wc -l)" -gt 10 ]; do sleep 1; done; rm -R $X & done cd ..
rmdir .wine

Another proposed solution is deceptively simple: find . -type f -exec rm -rfv {} ;. The problem with this is that it's using the verbose flag on rm, which slows things down, but more importantly every time find finds a file, it forks a child process to run the exec rm. However, an alternative would be to parse the find through xargs, which takes builds a command from standard input. This avoids a range of issues, such as the forking and child processes of the exec option, the traversal issues caused by globbing, and the parsing of a directory listing process as an input.

Thus for something that works fast and is easy, try the following:

cd ~/backup/.wine
find . | xargs rm -rf
cd ..
rmdir .wine

Further Reading

GNU Find, Deleting Files

How To Remove Backups

How To Delete A Million Files.