Starting in a directory on my server, I need to delete all directories below it which only contain files older than a certain age.
To be clear, I don't just want to delete old files; if a directory has any new files in it then I don't want to delete the older files that might also be in there.
Background: I run dspam on a server, and it creates a directory for each mail address it scans, containing a log and a quarantine mailbox, amongst other things. I have many many directories that were created automatically for non-existent users and haven't seen any changes in months or even years, but "active" directories do have old files in them (preferences files, etc). So I want to find directories that haven't had any of their contents change in (say) 6 months, and then delete the whole directory.
On 07/05/10 10:13, Mark Rogers wrote:
Starting in a directory on my server, I need to delete all directories below it which only contain files older than a certain age.
To be clear, I don't just want to delete old files; if a directory has any new files in it then I don't want to delete the older files that might also be in there.
Surely you just use find to return the directory timestamp as this will have been updated every time a file has been added. So directories that have a 6 month old timestamp haven't had new files added in 6 months.
What this won't cover is old files that have been modified recently, those will get deleted because the directory list will remain unchanged
Mark Rogers wrote:
Starting in a directory on my server, I need to delete all directories below it which only contain files older than a certain age.
To be clear, I don't just want to delete old files; if a directory has any new files in it then I don't want to delete the older files that might also be in there.
Background: I run dspam on a server, and it creates a directory for each mail address it scans, containing a log and a quarantine mailbox, amongst other things. I have many many directories that were created automatically for non-existent users and haven't seen any changes in months or even years, but "active" directories do have old files in them (preferences files, etc). So I want to find directories that haven't had any of their contents change in (say) 6 months, and then delete the whole directory.
A quick and dirty command line:
# create example directory tree. $ mkdir formark $ cd formark $ mkdir -p ./somedir/newonly ./somedir/newandold ./somedir/oldonly $ touch ./somedir/newonly/newish $ touch ./somedir/newandold/newish $ touch ./somedir/newandold/newish2 $ touch --date="1 year ago" ./somedir/newandold/oldish $ touch --date="1 year ago" ./somedir/newandold/oldish2 $ touch --date="1 year ago" ./somedir/oldonly/oldish3
# find the dirs with new files; we want to keep them $ find . -mindepth 1 -type f -mtime -180 | xargs -n 1 dirname | uniq | sort | tee keepers
# get all dirs, filter out the keepers $ find . -mindepth 1 -type d -exec /bin/sh -c "echo checking {}; if ! grep -q {} keepers; then echo {} >> delme; fi" ;
$ cat delme ./somedir/oldonly
# do the delete $ xargs --interactive -n 1 rm -r < delme
Or with a python script:
$ cat > delme.py <<EOM from __future__ import generators import os, time
old = time.time() - 6*30*24*60*60
def onlyolddir(dir): for f in os.listdir(dir): fullpath = os.path.join(dir,f) if os.path.getmtime(fullpath) > old: return False return True;
def dirwalk(dir): for f in os.listdir(dir): fullpath = os.path.join(dir,f) if os.path.isdir(fullpath) and not os.path.islink(fullpath): yield fullpath for x in dirwalk(fullpath): yield x
for elem in dirwalk("."): if onlyolddir(elem): print elem EOM $ python delme.py > delme
Disclaimer: Untested on complex filenames and directory structures, may set fire to your cat, make backups first, season to taste, etc etc.
-- Martijn
On 7 May 2010 12:24, Martijn Koster mak-alug@greenhills.co.uk wrote:
Mark Rogers wrote:
Starting in a directory on my server, I need to delete all directories below it which only contain files older than a certain age.
To be clear, I don't just want to delete old files; if a directory has any new files in it then I don't want to delete the older files that might also be in there.
A quick and dirty command line:
# create example directory tree. $ mkdir formark $ cd formark $ mkdir -p ./somedir/newonly ./somedir/newandold ./somedir/oldonly $ touch ./somedir/newonly/newish $ touch ./somedir/newandold/newish $ touch ./somedir/newandold/newish2 $ touch --date="1 year ago" ./somedir/newandold/oldish $ touch --date="1 year ago" ./somedir/newandold/oldish2 $ touch --date="1 year ago" ./somedir/oldonly/oldish3
# find the dirs with new files; we want to keep them $ find . -mindepth 1 -type f -mtime -180 | xargs -n 1 dirname | uniq | sort | tee keepers
# get all dirs, filter out the keepers $ find . -mindepth 1 -type d -exec /bin/sh -c "echo checking {}; if ! grep -q {} keepers; then echo {} >> delme; fi" ;
$ cat delme ./somedir/oldonly
# do the delete $ xargs --interactive -n 1 rm -r < delme
Makes me wonder, since it is doing an rm (yes I know it's interactive but too easy to "yeah whatever... i'll press y" if you have too many files), how would one go about unit testing this given that it's a bash script?
Regards, Srdjan
Srdjan Todorovic wrote:
Makes me wonder, ... how would one go about unit testing this given that it's a bash script?
For automated unit testing you would just write a script that populate a test tree (much like I did there), exec the script, then compare the left-over tree (just run a "find") with the expected one (a previously vetted "known good" version).
<tangent> Then expand the test tree to have more complexities: multilevel directories, different old/young timestamps, empty directories, directories and files with names starting with dashes/spaces/dots/asteriskses/backslashes/highbits, symlinks in the tree, symlinks outside the tree, to files and directories etc. Rather than sticking all those cases in a single test tree you might prefer to split the testcases up, to keep the setup/verification stages easier to manage, and run them as a suite from yet another script.
If producers/users are modifying the tree while you're running your script, then you may need to be even more careful, and that's harder to test in an automated fashion; you might need some mock filesystem.
And of course, put it under revision control, add a license, add comments, write documentation, note dependencies (python versions) and tested platforms, have it code reviewed. Etc etc. :-)
But first you really need to figure out what you're actually trying to achieve for this particular use case. Do you want to move these files to a review area rather than just deleting? Do you want to use some filename pattern matching to distinguish dspam logs/mailboxes and treat them differently? Does the depth of the node in the tree have some significant semantics? If you encounter unusual files, do you want to process them, or just abort so that the sysadmin will investigate? Can you perhaps change the producer to deposit its data in a different way (like year/month/day subdirectories) that are easier to dispose of? Is this just a small number of local files, or some massive distributed filesystem? Is this maintenance a one-off, or will this run regularly? Is the data mission critical? Are there audit/retention policies or backup management implications?
You can make this as complex as you choose. I just wanted to give Mark some quick inspiration before lunch :-) </tangent>
-- Martijn
Date: Fri, 7 May 2010 13:09:09 +0100 From: todorovic.s@googlemail.com To: main@lists.alug.org.uk Subject: Re: [ALUG] Deleting directories which only contain old files
how would one go about unit testing this given that it's a bash script?
You can use the sh command to run it. using the option -v will show the line its executing, before it executes. There are more options, I thought there was the -i which did the same. See "man sh"
# more ./shelltest.sh #/bin/sh echo line 1 echo line 2 echo . # sh -v ./shelltest.sh #/bin/sh echo line 1 line 1 echo line 2 line 2 echo . . #
HTH Keith
_________________________________________________________________ The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PI...
On 07/05/10 11:36, Wayne Stallwood wrote:
What this won't cover is old files that have been modified recently, those will get deleted because the directory list will remain unchanged
.. and therein lies the problem (in this case). Very often there will be a file (eg log file) which was created a couple of years ago but has been appended to regularly since then.
On 07/05/10 12:24, Martijn Koster wrote:
# find the dirs with new files; we want to keep them $ find . -mindepth 1 -type f -mtime -180 | xargs -n 1 dirname | uniq | sort | tee keepers
# get all dirs, filter out the keepers $ find . -mindepth 1 -type d -exec /bin/sh -c "echo checking {}; if ! grep -q {} keepers; then echo {} >> delme; fi" ;
$ cat delme ./somedir/oldonly
# do the delete $ xargs --interactive -n 1 rm -r < delme
Perfect, thanks - this did the trick. The only problem I had turned out to be a single directory whose name started with |, but that was easy enough to deal with manually. However, I'd be interested to know how it could be generalised to avoid that being a problem (mainly from a security point of view; not handling the directory was fine, but I'm sure someone could work out an exploit and I'm just curious as to how to avoid it in the general case).
I modified the "rm" to a "mv" so the directories are still there for the time being if I need them, but "du" on the temp directory that now holds them all tells me that I'll get about 400MB of disk space back on my server when I "rm" it, which is very welcome. So thanks again!
Mark Rogers wrote:
The only problem I had turned out to be a single directory whose name started with |, but that was easy enough to deal with manually. However, I'd be interested to know how it could be generalised to avoid that being a problem (mainly from a security point of view; not handling the directory was fine, but I'm sure someone could work out an exploit and I'm just curious as to how to avoid it in the general case).
You need to be careful about what interprets the string, and how; in this case the shell interpreted it, and treats it as a pipe symbol. To avoid that in general you need to be careful about using appropriate quoting/escaping, and it's very easy to get wrong, especially when you start passing strings between program invocations in shell scripts, read from files containing filenames etc.
For example, here I've added single quotes around the {}:
$ mkdir formark2 $ cd formark2 $ mkdir '|bardir' $ touch --date="1 year ago" '|bardir/old' $ find . -mindepth 1 -type d -exec /bin/sh -c "echo checking '{}'; echo '{}' >> delme; fi" ; $ cat delme ./|bardir $ xargs --interactive -n 1 rm -r < delme rm -r ./|bardir ?...yes
What happens here is that the double quotes are interpreted by the shell you're using to invoke find. So the find program will be invoked with 10 arguments, with the last-but-one being the text between the double quotes. It then replaces the {} occurrences with the current filename, so you end up with:
echo checking './|bardir'; echo './|bardir' >> delme
and it then invokes /bin/sh passing that as an argument. Because the pipe symbol is protected by those single quotes, it is not interpreted as a pipe command. Great! Except... this doesn't work if you have a file with a single quote:
$ mkdir "./single'quote" $ find . -mindepth 1 -type d -exec /bin/sh -c "echo checking '{}'" ; /bin/sh: Syntax error: Unterminated quoted string
because now you end up with:
echo checking './single'quote'
which is not legal syntax. You can work around that by using double quotes, but then you have a problem with filenames containing double quotes or dollar signs. Really you want to escape all meta characters, but then it rapidly becomes complex as you pass this through multiple levels of interpretation.
You're much better off avoid this problems: don't invoke shell commands with filenames or other unsafe input. For example:
$ find . -mindepth 1 -type d -print0 | xargs -0 -n 1 echo ./single'quote ./|bardir
What happens here is that your filename is written to stdout (terminated with a null character), read by xargs in the same way, and then passed as an argument to echo during an exec system call; it never gets near a shell.
And if you find yourself doing non-trivial logic (like in your example), use a scripting language where you just pass the string around different parts of the program in a variable, and eventually to some system call. Mind you, scripting languages have their own string meta characters and interpolation behaviour with associated quoting/escaping mechanisms, so you still need to be careful.
-- Martijn
On 10/05/10 13:49, Martijn Koster wrote:
You're much better off avoid this problems: don't invoke shell commands with filenames or other unsafe input. For example:
$ find . -mindepth 1 -type d -print0 | xargs -0 -n 1 echo ./single'quote ./|bardir
What happens here is that your filename is written to stdout (terminated with a null character), read by xargs in the same way, and then passed as an argument to echo during an exec system call; it never gets near a shell.
Ah, that would be what I was after. In fact I already knew the answer had I dug deep enough into the old brain memory!
"sort", "uniq" and "grep" all support null character delimited input/output so I guess that the script you gave me to start with could be reworked to avoid problems, although I haven't tested it.
And if you find yourself doing non-trivial logic (like in your example), use a scripting language where you just pass the string around different parts of the program in a variable, and eventually to some system call. Mind you, scripting languages have their own string meta characters and interpolation behaviour with associated quoting/escaping mechanisms, so you still need to be careful.
That did cross my mind; at least it would mean picking an environment in which I'm more familiar so I'd see the pitfalls earlier and know how to solve them. But on the other hand, I prefer the bash option as it's something I *should* know much better than I do.