Price of Commercials

The price of commercials is especially high for engineers. And by commercials, I don’t mean an intermission between pieces of a sitcom or drama, I mean the brief 15 seconds of an interruption when someone asks an engineer in the zone a question that takes 3 seconds to answer. For the sake of argument, let’s say an engineer gets interrupted a mere 5 times per day including lunch and a daily meeting (let’s call it a scrum for fun).

If it takes that engineer, admin, developer or whatever 10 minutes to get focused after each interruption and the initial getting into the office and getting into the swing of things; that means that out of an 8 hour day, 1 hour is wasted just refocusing. Refocusing just puts you back on the issue, it doesn’t put you back in the zone. Some engineers only get in the zone once per day. At that rate, you can massively waste someone’s productivity with a 10 second interruption.

What’s my point? Good question. That commercial/question/interruption that someone is pushing onto that engineer could be the straw that broke the camel’s back on a deadline. So be aware of the situation that your people are in, who is talking to them, who has access to them, and who takes advantage of that access. Those precious periods of concentration can afford you a huge win or bring about a big loss.

Posted in Misc. Tags: . 1 Comment »

Printing New Lines in Bash

Ran across this the other day and decided it required sharing. If you want to print a new line ‘\n‘ in an echo statement in bash, one would think its just as simple as:

beacon:~ elubow$ echo "This is a test\n"
This is a test\n

The problem is that this doesn’t interpolate the newline character. (For more information on interpolation, see Wikipedia here.) In order to have the newline interpolated, you need to add the command line switch ‘-e‘.

beacon:~ elubow$ echo -e "This is a test\n"
This is a test

This will force Bash to interpolate any non-literal characters in the quotes. Note: Unlike Perl, single or double quotes don’t matter here when Bash is deciding whether or not to interpolate the new line characters.

Posted in Misc. Tags: . No Comments »

Apple Steps Up With Multitouch Mouse

A while ago I wrote about how Apple needed an external multi-touch solution and that you can use your iPhone until then. Apple did it and now released the Magic Mouse.

To quote Apple, “It’s the world’s first multi-touch mouse.” It’s a wireless mouse that attaches to any computer that has a keyboard, mouse and Bluetooth via Bluetooth. It’s sleek just like everything else Apple makes. But the best part is that (as of now), its only $69. Good work Apple.

I could go on and on about why I think its cool and what it can do, but why waste time on my website reading a summary, just check it out on Apple’s web site: http://www.apple.com/magicmouse/

Posted in Mac. Tags: , . 1 Comment »

First Experience With Cassandra

I recently posted about my initial experience with Tokyo Cabinet. Now it’s time to get to work on Cassandra.

Cassandra is the production database that’s in use on Facebook for handling their email system and Digg.

One thing that I would like to note is that when I tested TC, I used the Perl API for both TC and TT. I tried both the Perl API and the Ruby API. I couldn’t get the Ruby API (written by Evan Weaver of Twitter) it to work at all with the Cassandra server (although I am sure the server included with the gem works well). I initially struggled quite a bit with the UUID aspects of the Perl API until I finally gave up and changed the ColumnFamily CompareWith type from

<columnfamily CompareWith="TimeUUIDType" Name="Users" />

to

<columnfamily CompareWith="BytesType" Name="Users" />

Then everything was working well and I began to write my tests. The layout that I ended up using is going to be one that works in a schemaless fashion. I created 2 consistent columns per user: email and person_id. Here is where it gets interesting and different for those of us used to RDBMS’s and having less columns. For this particular project, every time a user is sent an email, there is a “row” (I call it a row for those unfamiliar with Cassandra terminology, it is actually a column) added in the format of: send_dates_<date> (note the structure below). The value of this column is the mailing campaign id sent to the user on this date. This means that if the user receives 365 emails per year at one a day, then there will be 365 rows (or Cassandra columns) that start with send_dates_ and end with YYYY-MM-DD. Note the data structure below in a JSON ish format.

Users = {
    'foo@example.com': {
        email: 'foo@example.com',
        person_id: '123456',
        send_dates_2009-09-30: '2245',
        send_dates_2009-10-01: '2247',
    },
    'bar@baz.com': {
        email: 'bar@baz.com',
        person_id: '789',
        send_dates_2009-09-30: '2245',
        send_dates_2009-10-01: '2246',
    },
}

To understand all the data structures in Cassandra better, I strongly recommend reading WTF Is A SuperColumn Cassandra Data Model and Up And Running With Cassandra. They are written by folks at Digg and Twitter respectively and are well worth the reads.

So for my first iteration, I simply loaded up the data in the format mentioned above. Every insert does an insert of an email and person_id just in case they aren’t there to begin with. The initial data set has approximately 3.6 million records. This caused all sorts of problems with the default configurations (ie kept crashing on me). The changes I made to the default configuration are as follows:

  • Change the maximum file descriptors from 1024 (system default) to 65535 (or unlimited)
  • Change the default minimum and maximum Java -Xms256M -Xmx2G (could not get the data to load past 2.5 million records without upping max memory values)
[elubow@db5 db]$ time ./cas_load.pl -D 2009-09-30 -c queue-mail.ini -b lists/
usa: 99,272
top: 3,661,491
Total: 3,760,763

real    72m50.826s
user    29m57.954s
sys     2m18.816s
[elubow@db5 cassandra]# du -sh data/Mailings/ # Prior to data compaction
13G     data/Mailings/
[elubow@db5 cassandra]# du -sh data/Mailings/ # Post data compaction
1.4G    data/Mailings/

It was interesting to note that the write latency of about 3.6 million records was 0.004 ms. Also the data compaction brought the size of the records on disk down from 13G to 1.4G. Those figures are being achieved with the reads and writes happening on the same machines.

The load of the second data set took a mere 30m when compared to loading that same data set into Tokyo Cabinet which took closer to 180m.

luxe: 936,911
amex: 599,981
mex: 39,700
Total: 1,576,592

real    30m53.109s
user    12m53.507s
sys     0m59.363s
[elubow@db5 cassandra]# du -sh data/Mailings/
2.4G    data/Mailings/

Now that there is a dataset worth working with, it’s time to start the read tests.

For the first test, I am doing a simple get of the email column. This is just to iterate over the column and find out the approximate speed of the read operations.

Run 1: 134m59.923s
Run 2; 125m55.673s
Run 3: 127m21.342s
Run 4: 119m2.414s

For the second test, I made use of a Cassandra feature called get_slice. Since I have columns that are in the format send_dates_YYYY-MM-DD, I used get_slice to grab all column names on a per-row (each email address) basis that were between send_dates_2009-09-29 and send_dates_2009-10-29. The maximum amount that could be matched were 2 (since I only loaded 2 days worth of data into the data base). I used data from a 3rd day so I can get 0, 1, or 2 as results.

This first run is using the Perl version of the script.

Email Count: 3557584
Match 0: 4,247
Match 1: 1,993,273
Match 2: 1,560,064

real    177m23.000s
user    45m21.859s
sys     9m17.516s

Run 2: 151m27.042s

Subsequent runs I began to run into API issues and rewrote the same script in Python to see if the more well tested Thrift Python API was faster than the Thrift Perl API (and wouldn’t give me timeout issues). The Perl timeout issues ended up being fixable, but I continued with the tests in Python.

[elubow@db5 db]$ time python cas_get_slice.py
{0: 4170, 1: 1935783, 2: 1560042}
Total: 3557584

real    213m57.854s
user    14m57.768s
sys     0m51.634s

Run 2: 132m27.930s
Run 3: 156m19.906s
Run 4: 127m34.715s

Ultimately with Cassandra, there was quite a bit of a learning curve. But in my opinion is well worth it. Cassandra is an extremely powerful database system that I plan on continuing to explore in greater detail with a few more in depth tests. If you have the chance, take a look at Cassandra.