Counting Frequencies of Frequencies

Lots of people forget about the usefulness of the core utilities (the tools available in Bash). I am even pretty guilty of it at times with such quick and easy things like Perl, Ruby, or Python that allow you to process items from the command line. However, they load up an entire interpreter. It is usually better to use the coreutils.

I’ll give you the specific example I had to deal with, but this can be extrapolated out and I’m sure reused for other purposes. I have lists of email addresses of who has received emails that were sent over the past 7 days. I do this with a find command since the address files are created nightly. They are laid out with 1 email address (and some other meta data) per line. You’ll notice the awk in the find command; that’s just to extract the email address from the line to make things easier to work with.

The overall goal here is find out how many times people were emailed over the past 7 days. Another way of saying it is, how many people received 1,2,3,4, etc mailings over the past week. This is an exercise in aggregation.

First, here is everything in its entirety and then I will proceed to go through it all piece by piece.

$ find /nfs/mailings/lists/ -maxdepth 3 -ctime -7 -name "master_list.txt" \
-exec awk '{print $2}' '{}' >> /tmp/emails.txt \; && cat /tmp/emails.txt | \
awk '{ for (i=1;i< =NF;i++) count[$i]++ }  END { for (i in count) print count[i], i }' $* |\
awk '{print $1}' | awk '{ for (i=1;i<=NF;i++) count[$i]++ }  \
END { for (i in count) print count[i], i }'
$* | awk '{print $2,$1}' | \
sort -n && rm -rf /tmp/emails.txt
1 93278
2 21415
3 16924
4 3064131
5 548421
6 102

A quick breakdown goes as follows.

1) Find all mailing list master files (-name) that aren’t archived (-maxdepth) that were created in the last 7 days (-ctime) and print their id (exec awk) to a tmp file. Everything here needs to be concatenated so as not to clobber the emails added by the previous day.

find /nfs/mailings/lists/ -maxdepth 3 -ctime -7 -name "master_list.txt"\
-exec awk '{print $2}' '{}' >> /tmp/emails.txt \;

2) The next piece word frequency counts the emails and prints them to STDOUT in a manner. This is to say that the first column is a count of how many times the second column has received an email over the past 7 days. Now we know how many emails each individual has received and we need to aggregate again.

cat /tmp/emails.txt | awk '{ for (i=1;i< =NF;i++) count[$i]++ }  \
END { for (i in count) print count[i], i }'
$*

3) Because I am using pipes, instead of printing the previous step to STDOUT, I only print the first column (count) and then count the frequency of that.

awk '{print $1} | awk '{ for (i=1;i< =NF;i++) count[$i]++ } \
END { for (i in count) print count[i], i }' $*

4) Then I swap the columns so I get . This is much more logical and is the way we generally read tables.

awk '{print $2,$1}'

5) Finally I sort that result numerically ascending and remove the temp file so I don’t cat things to it again.

sort -n && rm -rf /tmp/emails.txt

6) We get our output in an easy to read format (Note: I added the headers).

Freq   Total
 1     93278
 2     21415
 3     16924
 4     3064131
 5     548421
 6     102

Posted in Tips. Tags: , . No Comments »