Archive for 2009

Git Command Aliases

Monday, December 28th, 2009

This is kind of a tip of the day, but I just think its cool so I am sharing it with everyone. And being a recent convert to Git and the fact that I have to use Subversion at my place of work, I find myself constantly doing things like this out of habit.

1
$ git st  && git ci

Well now I can do that (although it may not be a good idea) with git alias:

1
2
3
4
5
elubow@beacon (master) supportskydivers$ git config --global alias.st status
elubow@beacon (master) supportskydivers$ git config --global alias.ci commit
elubow@beacon (master) supportskydivers$ git st && git ci
# On branch master
nothing to commit (working directory clean)

Now st and ci are git aliases for status and commit respectively.

What Does Web 2.0 Mean To You?

Wednesday, December 23rd, 2009

I have been doing a lot of reading and a lot of thinking and trying to decide what exactly Web 2.0 means. What massive advancement in an emerging technology called the internet advocates an increment in major version number?

Some people say its the looks. The new feel of the internet with crazy CSS and rounded corners and a lighter more airy feeling. I don’t think that’s it.

Some people say that its the AJAX layer that has been added to the internet. This refers to the layer of interactivity a page web page can give you. I don’t think it’s this either.
(more…)

Python Multiprocessing Pools and MySQL

Monday, December 21st, 2009

There really isn’t a solid Python module for multiprocessing and MySQL. Now this may be because MySQL on a single server is disk bound and therefore limited in speed or just because no one has written it. So here is a quick and dirty example using the Pool module in multiprocessing in Python 2.6 and MySQLdb.

I also tried using PySQLPool. This was designed for threading and not forking as I am doing with Pool method. Although I am sure it is possible to use PySQLPool with forking by passing the connection (pool) object down to the child process or possibly doing something with IPC, I decided to keep it simple (although slightly more expensive) and instantiate MySQLdb connections upon fork.
(more…)

Python's MySQLdb 2014 Error – Commands out of sync

Friday, December 18th, 2009

While writing a simple Python script to access and process data in a database, I came across an error that said:

1
Error 2014: Commands out of sync; you can't run this command now

After quite a bit of Googling and with very little findings, I had to dive in a little and try to figure out what was going on. The whole error looked like this:
(more…)

Sitemaps On Rails

Thursday, December 17th, 2009

SEO being an interest of mine, I couldn’t quite wrap my head around releasing a webapp without a sitemap. The problem is that there aren’t any really great sitemap plugins for Rails. Now I will grant that creating a sitemap in Rails is a challenging proposition and one that I would not like to undertake on my own unless absolutely necessary. But I was hoping that there would be a rails sitemap “killer app” like there is with almost everything else in Rails.

So I dove in and tried a few options until I found one that worked. First I wrote some code to generate an XML file and then created a sitemap_index.xml.gz file by hand. This was very kludgy and definitely not a permanent solution. I had also read suggestions about doing it in a sitemap_controller.rb file, but that seemed just as kludgy as using a view to generate the XML. It was then time to explore the plugin world.
(more…)

Custom Google Maps Marker With YM4R_GM

Monday, December 14th, 2009

In one my Rails applications, I allow the user to search for surrounding businesses from their current location. I always showed them a You Are Here marker. The issue I had with this was that the marker was always the icon as the search results. Differentiating these markers is actually extremely easy with ym4r_gm plugin.

First thing is to find a custom icon that you want to use. You can just Google for custom Google maps icons. I chose to use their default icon, just in a different blue. (You can download it here so you are working with what I am working with for this example). The next thing I did was to use the Google custom markers web site to find the proper config options for the icon.
(more…)

Git Branch Name in Your Bash Prompt

Friday, December 11th, 2009

I work with a few repositories at any given time. And during that time, I typically have multiple branches created for each repository. I figured that it would make my life easier if I knew which branch and/or repository I was working in. Luckily, very little hackery is required here since the git distribution already comes with such a tool. (Note: If you didn’t build Git from source, then you may not have this file.)
(more…)

AT&T – Reactive vs. Proactive

Thursday, December 10th, 2009

As much as I hate to steal a title or a good joke, I want to title this post iPhone Outage? There’s An App For That. Why? Because it’s funny.

So why am I talking about reactive vs. proactive? In case you haven’t seen it yet, AT&T recently came out with an app called AT&T Mark The Spot. The idea behind the app is that if you have a dropped call or bad reception, that you open the app, click your problem and it will mark the spot by sending the information to AT&T. I am still not entirely sure how this app works in an area where there is NO reception, how does it know where you are to tell AT&T?
(more…)

Transferring Email From Gmail/Google Apps to Dovecot With Larch

Wednesday, December 9th, 2009

As regular readers of this blog know, I am in the process of trying to back up Google Apps accounts to Dovecot. Well I have finally found my solution. Not only does it work, but its in Ruby.

First thing that you’ll need to do is grab yourself a copy of Larch. I did this simply by typing and it installed everything nicely, but click the link to the repository on Github if it doesn’t work for you.
(more…)

Country-State Select Using Carmen and jQuery

Monday, December 7th, 2009

I’ve been wanting to find a way to use drop down menus for countries and their states when they exist. But keeping a list on my own would have been a huge pain in the ass. So rather reinvent the wheel, I found the Carmen plugin for Rails. All I have to do is keep the plugin updated and my list of countries and their states will be kept updated as well.

How do I do all this with unobtrusive Javascript and Rails you ask? Good question. Let me show you. Don’t forget to install the plugin (or use the gem).

Let’s start out by adding the drop down menu to our view. In my case I have it in a partial for the address part of a form. You’ll have to modify this slightly to pick up the values of the form if the partial handles edits as well. This one is just for a new method as it uses a default country of US and its states. Note the div ID here of addressStates; we will be using this later in the javascript.
(more…)

Backing Up Gmail/Google Apps to a Dovecot Server

Thursday, December 3rd, 2009

I have been trying to find a way to copy everything from a Gmail account to a Dovecot mail server. The way I have ended up doing it so far is simply by using Apple Mail (if you regularly read this blog, you’d know that I use a Mac). The steps are as follows:

  1. Create 2 accounts in Apple Mail: Gmail and the Dovecot account
  2. Sync the Gmail account to your local computer
  3. Copy everything to the Dovecot server

This works, but I have to use a slow connection (my home connection) and I have a lot of accounts to do this for, so I would much prefer to script this. The problem is that I have been trying to get this to work with either imapsync or imapcopy. Neither seem to work properly.
(more…)

Creating a Slave DNS Server on Bind9

Sunday, November 29th, 2009

I couldn’t find a quick and dirty list of commands for setting up a slave DNS server so I figured I would just throw it together.

Starting with a fully working primary name server, we are going to set up a slave name server. We are going to make the following assumptions:
primary – 1.2.3.4
slave – 4.5.6.7
* We want to have the domain example.com have a slave name server

On the primary (or master) name server, add the following lines to the options section.

1
2
3
4
options {
    allow-transfer { 4.5.6.7; };
    notify yes;
};

Ensure that you update the serial number in the SOA on the master. Then run:

1
# rndc reload

On the slave name server, add the following entry to the named.conf file (or whichever file houses your zone entries). Ensure that the path leading up to the zone file exists and that bind has write access to that directory.

1
 zone "example.com"  { type slave; file "/etc/bind9/zones/example.com.slave"; masters { 1.2.3.4; }; };

Then once you made the changes to the slave, you will need to reload the configuration. Do this the same way you did on the master:

1
# rndc reload

If you watch your DNS log, you should see the transfer happen as soon as you restart both named servers.

SSH Over The Web With Web Shell

Friday, November 27th, 2009

After reading a Tweet from Matt Cutts about being able to SSH from the iPhone (and the web in general), I had to give it a try. I am always looking for better ways to be able to check on systems when necessary. I have iPhone apps for SSHing around if I need as well, but like with any “new” tool, I have to try it out to see if it serves a purpose or makes my admin life easier in any way.

First go check out the Google Code repository for Web Shell. Webshell is written in Python and is based on Ajaxterm. All that’s required is SSL And Python 2.3 or greater. It works on any browser that has Javascript and can make use of AJAX.

The way Web Shell works is you start it up on a server and then can use a web browser to access only that machine over SSH. The works best if you have a gateway server to a network and use a single point of entry to access the rest of the servers. Web Shell runs on HTTPS on port 8022. Reading the README will lead you through the same set of instructions I used below. Once installed, we connect by using a web browser: https://server.com:8022/
(more…)

Adding AJAX Bookmarks to Your Rails Application (Part 2 of 2)

Wednesday, November 25th, 2009

In part 1 of this series, we discussed the base models, controller, database migrations necessary to get this project off the ground. Now we are going to continue with this functionality

Let’s take a look at what needs to go into the models to support this. If you have a model that uses a slug generated via to_param, then your code will look like the top model, If you are using the normal numeric id convention, then it will look like the bottom model. The reason for the specifically named methods get_title and get_description will become apparent when you start displaying bookmarks. The thought process is that you can use a consistent set of calls for displaying the bookmark information and put the code to grab that information in the model where it belongs rather than loading up the helper methods. What should also be noted is that the title and description fields are not always consistent across models. Therefore the method naming conventions returns the proper column with consistent method names.
(more…)

Adding AJAX Bookmarks to Your Rails Application (Part 1 of 2)

Monday, November 23rd, 2009

It you want to add the ability to bookmark pages in your Rails application, its actually a fairly straightforward thing to do. You can even do them in AJAX. There may be better ways to do this, but this way is somewhat abstract and it works for me, so hopefully it can work for you too. It is abstract in the sense that it will work for models with different URL styles and different column names.

The way this works is that you add a bookmark icon (which is initially disabled) to a show <model_name> page. When the user clicks on the bookmark icon, an AJAX query will be made in the background and update the users bookmark lists. I am approaching this from an abstract methodology. Meaning that I have “forced” these methods to work with models executed in various fashions (as I give examples of below). The AJAX call is going to be simply work as a toggle. It will actually call a toggle method in the bookmarks controller and change the current value and replace the image. The user can then view the pages they have bookmarked in their profile.

I have decided to break this into a multi-part blog entry because it ends up being quite long. Not necessarily in how long it takes, just the amount of space it takes to show all the code. I have done my best to only show relevant code and maintain brevity. Note: I will not cover how to allow for unobtrusive AJAX calls. That is beyond the scope of this set of posts.
(more…)

Modsecurity 2.5 Review Coming

Sunday, November 22nd, 2009

The folks over at Packt Publishing are kind enough to send me out an advance copy of the upcoming Modsecurity book by Magnus Mischel. I have written about mod security before, but really haven’t had a chance to look into it recently. I am anxious to see where its advanced to in version 2.5.

If you don’t know anything about mod_security, I encourage you to read up on it in the interim.

Stay tuned for the review.

File Read Write Create with IO::File

Friday, November 20th, 2009

Ran into an annoying gotchya with Perl’s IO::File. Apparently opening the file in append mode with read access if the file already exists puts the file position pointer at the end of the file. If it doesn’t exist, it creates the file. Note the +>>, that opens the file r/w/append. You can also use the more common (and more easily recognizable) form of a+.

1
2
3
4
5
    my $FH = new IO::File "$file", "+>>";
    while (my $line = $FH->getline()) {
      print "Line: $line\n";
    }
    undef $FH;

I noticed that when I tried to read the file (if it already existed), then nothing would be read. I neglected to realize that you must seek to position 0 in the file if you want to read it. Therefore the following code will work:

1
2
3
4
5
6
    my $FH = new IO::File "$file", "+>>";
    $FH->seek(0,0);
    while (my $line = $FH->getline()) {
      print "Line: $line\n";
    }
    undef $FH;

Although it might seem obvious that you need to be at the beginning of the file to read it forward (and it is), I didn’t realize the file pointer opened a file in append mode to the last position in the file (in hind sight, it does appear to be a bit more obvious).

Thoughts on Blog Posting

Thursday, November 19th, 2009

During a conversation I was having with Nirvdrum about blog posts, we got to discussing the validity and credibility of blog posting along with how and why people do it. I have a few thoughts on this topic.

The first and foremost reason that I write blog posts is that engineers who spend a lot of time figuring things out on the fly could use a helping hand. A lot of that figuring is done piecing together parts of other people’s solutions to problems from various blogs and papers. Every time I run into an issue or fix a problem, I try to write a blog post about it. I don’t do this because I feel that I have more to offer than anyone else, I just feel like my work should be able to benefit others (there is no use in reinventing the wheel). And to top it off, if I do something and someone has a better way, I like hearing about it in the comments or from an email.
(more…)

Converting From Subversion To Git

Monday, November 16th, 2009

Now that I have basically fallen for Git, I decided to finally move my Subversion repository over to Git (this way I can finally have a remote backup of it that I am comfortable with on Codaset).

The method for this was a lot more straightforward than I expected it to be. For the conversion tool, I used Nirvdrums fork of svn2git. It a feature complete version of the svn2git portion though the rest of it is still is development. Since it is a Ruby gem, getting it installed was a breeze. Just make sure that you have Ruby and rubygems installed.
(more…)

Remote Code Storage

Monday, November 9th, 2009

I was chatting with a friend of mine the other day about version control and why it’s necessary. So I decided to throw together a few options and a little explanation about why its important.

I have been using version control in some form or another for many years. I started with CVS, then moved to Subversion (which I still use quite a bit), and now, as my latest post about Git GUI’s on the Mac suggests, I have moved to Git. The one thing that has been consistent across every single transition has been that I had some sort of remote code storage every time. During the CVS days, I used a CVS pserver and stored my code locally and remotely for safety (and ease of checkout/deployment). For subversion, I always stored my code locally and used an apache install somewhere with a WebDAV module to get at and deploy whatever code is necessary.

Ultimately I use remote code storage for 2 reasons, back up my existing code base (so I have it in more than one place) and to have a visualization of what is going on in your project. That visualization is handy to be used as a central consistent view for multiple people (unlike a personal client which can be different per user).
(more…)

Git GUI on Mac OS X

Friday, November 6th, 2009

I have been using Git a lot lately and have found a lot of things I like better in Git than in Subversion. The one major item that was really bothering me was that there wasn’t really too many Git clients that could help you visualize the repository. I mean show merges, commits, branching, blame, etc. Seeing that CVS and Subversion have been around for a lot longer, there are many clients for them and now that I have been using Git for a while on the command line, I decided to take a look again.

What I am looking for is simple. I want 2 things:

  1. In the typical Mac style, I want a great looking interface. I want to be able to see who did what, when, and why (assuming good commit messages from the developers).
  2. Easy navigation through all the features. I am not planning on using any of the commands visually, I am still an archaic command line junkie.

One of my favorite features of git coming from Subversion is the ease of branching. I branch for everything now that I am using git. So in order to best track my changes, I was hoping for something to help me visualize my branches. I didn’t count this specifically in my desires because it wasn’t a requirement to be acceptable, but it definitely would have helped to tip the scales.
(more…)

One Time Modal Windows With Rails and Fancybox

Tuesday, November 3rd, 2009

Let’s say that you have a situation that you want to have a modal window show up only once for each user. It’s actually not that difficult although lots of Googling around got me nowhere. I am choosing to use FancyBox for my modal window, but feel free to use your modal framework of choice. So let’s get down to business.

First thing you’ll need to do is download FancyBox and copy the stylesheets, images, and Javascript files to their proper/desired location in your Rails app. Style the window according to your likings.

Whether it is right or wrong, I did this entirely in the view, without even pulling the Javascript out into the application.js (or even another Javascript file for that matter). My reason was that I only want the modal window showing up on this page. If you want your modal window to show up somewhere else (or on every page), then put the code in your layout. But remember that this call will be executed every time the page loads. I put mine in a profile page which doesn’t get accessed that often so the conditional is not checked quite as frequently.

My application uses Facebook Connect and grabs the users Facebook Proxy email address (FB app developers will know what this is). So I check if that’s the email I have for the user. If yes, then I pop up a modal window on page load only once to get their regular email address and possibly a password so they can login to their account without Facebook Connect if they want. When the modal window is shown, a variable is set in the cookie (note that this cookie is shared with authlogic for sessions) to ensure that the modal window isn’t shown again.
(more…)

Price of Commercials

Wednesday, October 28th, 2009

The price of commercials is especially high for engineers. And by commercials, I don’t mean an intermission between pieces of a sitcom or drama, I mean the brief 15 seconds of an interruption when someone asks an engineer in the zone a question that takes 3 seconds to answer. For the sake of argument, let’s say an engineer gets interrupted a mere 5 times per day including lunch and a daily meeting (let’s call it a scrum for fun).

If it takes that engineer, admin, developer or whatever 10 minutes to get focused after each interruption and the initial getting into the office and getting into the swing of things; that means that out of an 8 hour day, 1 hour is wasted just refocusing. Refocusing just puts you back on the issue, it doesn’t put you back in the zone. Some engineers only get in the zone once per day. At that rate, you can massively waste someone’s productivity with a 10 second interruption.

What’s my point? Good question. That commercial/question/interruption that someone is pushing onto that engineer could be the straw that broke the camel’s back on a deadline. So be aware of the situation that your people are in, who is talking to them, who has access to them, and who takes advantage of that access. Those precious periods of concentration can afford you a huge win or bring about a big loss.

Printing New Lines in Bash

Thursday, October 22nd, 2009

Ran across this the other day and decided it required sharing. If you want to print a new line ‘\n‘ in an echo statement in bash, one would think its just as simple as:

1
2
beacon:~ elubow$ echo "This is a test\n"
This is a test\n

The problem is that this doesn’t interpolate the newline character. (For more information on interpolation, see Wikipedia here.) In order to have the newline interpolated, you need to add the command line switch ‘-e‘.

1
2
beacon:~ elubow$ echo -e "This is a test\n"
This is a test

This will force Bash to interpolate any non-literal characters in the quotes. Note: Unlike Perl, single or double quotes don’t matter here when Bash is deciding whether or not to interpolate the new line characters.

Apple Steps Up With Multitouch Mouse

Tuesday, October 20th, 2009

A while ago I wrote about how Apple needed an external multi-touch solution and that you can use your iPhone until then. Apple did it and now released the Magic Mouse.

To quote Apple, “It’s the world’s first multi-touch mouse.” It’s a wireless mouse that attaches to any computer that has a keyboard, mouse and Bluetooth via Bluetooth. It’s sleek just like everything else Apple makes. But the best part is that (as of now), its only $69. Good work Apple.

I could go on and on about why I think its cool and what it can do, but why waste time on my website reading a summary, just check it out on Apple’s web site: http://www.apple.com/magicmouse/

First Experience With Cassandra

Monday, October 19th, 2009

I recently posted about my initial experience with Tokyo Cabinet. Now it’s time to get to work on Cassandra.

Cassandra is the production database that’s in use on Facebook for handling their email system and Digg.

One thing that I would like to note is that when I tested TC, I used the Perl API for both TC and TT. I tried both the Perl API and the Ruby API. I couldn’t get the Ruby API (written by Evan Weaver of Twitter) it to work at all with the Cassandra server (although I am sure the server included with the gem works well). I initially struggled quite a bit with the UUID aspects of the Perl API until I finally gave up and changed the ColumnFamily CompareWith type from

1
<columnfamily CompareWith="TimeUUIDType" Name="Users" />

to

1
<columnfamily CompareWith="BytesType" Name="Users" />

Then everything was working well and I began to write my tests. The layout that I ended up using is going to be one that works in a schemaless fashion. I created 2 consistent columns per user: email and person_id. Here is where it gets interesting and different for those of us used to RDBMS’s and having less columns. For this particular project, every time a user is sent an email, there is a “row” (I call it a row for those unfamiliar with Cassandra terminology, it is actually a column) added in the format of: send_dates_<date> (note the structure below). The value of this column is the mailing campaign id sent to the user on this date. This means that if the user receives 365 emails per year at one a day, then there will be 365 rows (or Cassandra columns) that start with send_dates_ and end with YYYY-MM-DD. Note the data structure below in a JSON ish format.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Users = {
    'foo@example.com': {
        email: 'foo@example.com',
        person_id: '123456',
        send_dates_2009-09-30: '2245',
        send_dates_2009-10-01: '2247',
    },
    'bar@baz.com': {
        email: 'bar@baz.com',
        person_id: '789',
        send_dates_2009-09-30: '2245',
        send_dates_2009-10-01: '2246',
    },
}

To understand all the data structures in Cassandra better, I strongly recommend reading WTF Is A SuperColumn Cassandra Data Model and Up And Running With Cassandra. They are written by folks at Digg and Twitter respectively and are well worth the reads.

So for my first iteration, I simply loaded up the data in the format mentioned above. Every insert does an insert of an email and person_id just in case they aren’t there to begin with. The initial data set has approximately 3.6 million records. This caused all sorts of problems with the default configurations (ie kept crashing on me). The changes I made to the default configuration are as follows:

  • Change the maximum file descriptors from 1024 (system default) to 65535 (or unlimited)
  • Change the default minimum and maximum Java -Xms256M -Xmx2G (could not get the data to load past 2.5 million records without upping max memory values)
1
2
3
4
5
6
7
8
9
10
11
12
[elubow@db5 db]$ time ./cas_load.pl -D 2009-09-30 -c queue-mail.ini -b lists/
usa: 99,272
top: 3,661,491
Total: 3,760,763

real    72m50.826s
user    29m57.954s
sys     2m18.816s
[elubow@db5 cassandra]# du -sh data/Mailings/ # Prior to data compaction
13G     data/Mailings/
[elubow@db5 cassandra]# du -sh data/Mailings/ # Post data compaction
1.4G    data/Mailings/

It was interesting to note that the write latency of about 3.6 million records was 0.004 ms. Also the data compaction brought the size of the records on disk down from 13G to 1.4G. Those figures are being achieved with the reads and writes happening on the same machines.

The load of the second data set took a mere 30m when compared to loading that same data set into Tokyo Cabinet which took closer to 180m.

1
2
3
4
5
6
7
8
9
10
luxe: 936,911
amex: 599,981
mex: 39,700
Total: 1,576,592

real    30m53.109s
user    12m53.507s
sys     0m59.363s
[elubow@db5 cassandra]# du -sh data/Mailings/
2.4G    data/Mailings/

Now that there is a dataset worth working with, it’s time to start the read tests.

For the first test, I am doing a simple get of the email column. This is just to iterate over the column and find out the approximate speed of the read operations.

1
2
3
4
Run 1: 134m59.923s
Run 2; 125m55.673s
Run 3: 127m21.342s
Run 4: 119m2.414s

For the second test, I made use of a Cassandra feature called get_slice. Since I have columns that are in the format send_dates_YYYY-MM-DD, I used get_slice to grab all column names on a per-row (each email address) basis that were between send_dates_2009-09-29 and send_dates_2009-10-29. The maximum amount that could be matched were 2 (since I only loaded 2 days worth of data into the data base). I used data from a 3rd day so I can get 0, 1, or 2 as results.

This first run is using the Perl version of the script.

1
2
3
4
5
6
7
8
9
10
Email Count: 3557584
Match 0: 4,247
Match 1: 1,993,273
Match 2: 1,560,064

real    177m23.000s
user    45m21.859s
sys     9m17.516s

Run 2: 151m27.042s

Subsequent runs I began to run into API issues and rewrote the same script in Python to see if the more well tested Thrift Python API was faster than the Thrift Perl API (and wouldn’t give me timeout issues). The Perl timeout issues ended up being fixable, but I continued with the tests in Python.

1
2
3
4
5
6
7
8
9
10
11
[elubow@db5 db]$ time python cas_get_slice.py
{0: 4170, 1: 1935783, 2: 1560042}
Total: 3557584

real    213m57.854s
user    14m57.768s
sys     0m51.634s

Run 2: 132m27.930s
Run 3: 156m19.906s
Run 4: 127m34.715s

Ultimately with Cassandra, there was quite a bit of a learning curve. But in my opinion is well worth it. Cassandra is an extremely powerful database system that I plan on continuing to explore in greater detail with a few more in depth tests. If you have the chance, take a look at Cassandra.

Migrations Without belongs_to Or references

Wednesday, October 14th, 2009

Normally when do a database migration in Rails, when adding ownership from a model to another model, you use the concept of belongs_to or references:

1
2
3
4
  create_table :comments do |t|
    t.belongs_to :user
    t.references :post
  end

Interestingly enough, these methods are only available during the initial table creation. If you want to add a reference to a model that is created later, you have to do it the old fashioned way, by just adding a column:

1
   add_column :comments, :group_id, :integer

Doing it this way is clean, easy, and definitely meets the KISS principle. But I do find it interesting that one can’t add an association later in the game. Sometimes the Rails way is just KISS and adding the column by hand.

Parsing Ini Files With Ruby

Sunday, October 11th, 2009

There doesn’t seem a lot of documentation or examples about parsing ini files in Ruby. There are definitely shortcut ways to do it and I could always write my own methods, but why reinvent the wheel when there are gems? So start out by simply installing the inifile gem.

1
2
3
4
5
beacon:~ elubow$ sudo gem install inifile
Successfully installed inifile-0.1.0
1 gem installed
Installing ri documentation for inifile-0.1.0...
Installing RDoc documentation for inifile-0.1.0...

The code for the gem is available from github here. Other inifile documentation is available here. The rest of the inifile documentation is a good reference but doesn’t contain any examples.

For some reason (which I don’t understand so please feel free to explain it in the comments if you know), you have to do more than just the standard require statement for this gem. At the top of your Ruby code, add the lines below. Make sure that you replace the directory location with your directory location.

1
2
3
require 'rubygems'
$:.unshift( '/usr/lib64/ruby/gems/1.8/gems/inifile-0.1.0/lib/' )
require 'inifile'

A short example of the ini file that we will be working with:

1
2
3
4
[foo]
bar = "baz"
dir = "2009-10-05/"
id = 75

To get the id parameter of the ini file assuming you know its in the [foo]section, you can use the code below. Notice the parameter section of the new object instantiator. The reason for this is that ini files are pretty abstract can have a few variations on format. Therefore you can specify the comment style and parameter definition style during object instantiation. My ini files use the ‘=’ to assign parameters

1
2
3
  ini = IniFile.new( options[:conf], :parameter => '=' )
  section = ini[foo]
  id = section['id']

Using the above code the id variable now contains the contents of the id parameter from the ini file.

Tokyo Tyrant and Tokyo Cabinet

Friday, October 9th, 2009

Tokyo Tyrant and Tokyo Cabinet are the components for a database used by Mixi (basically a Japanese Facebook). And for work, I got to play with these tools for some research. Installing all this stuff along with the Perl APIs is incredibly easy.

Ultimately I am working on a comparison of Cassandra and Tokyo Cabinet, but I will get to more on Cassandra later.

Ideally the tests I am going to be doing are fairly simple. I am going to be loading a few million rows into a TCT database (which is a table database in TC terms) and then loading key, value pairs into the database. The layout in a hash format is basically going to be as follows:

1
2
3
4
{
      "user@example.com" => {   "sendDates" => {"2009-09-30"},   },
      "123456789" => {  "2009-09-30" => {"2287"}   },
}

I ran these tests in the following formats for INSERTing the data into the a table database and as serialized data in a hash database. It is necessary to point out that the load on this machine is the normal load. Therefore it cannot be a true benchmark. Since the conditions are not optimal (but really, when are they ever), take the results with a grain of salt. Also, there is some data munging going on during every iteration to grab the email addresses and other data. All this is being done through the Perl API and Tokyo Tyrant. The machine that this is running on is a Dual Dual Core 2.5GHz Intel Xeon processor with 16G of memory.

For the first round, a few things should be noted:

  • The totals referenced below are email address counts add/modified in the db
  • I am only using 1 connection to the Tokyo Tyrant DB and it is currently setup to handle 8 threads
  • I didn’t do any memory adjustment on startup, so the default (which is marginal) is in use
  • I am only using the standard put operations, not putcat, putkeep, or putnr (which I will be using later)

The results of the table database are as follows. It is also worth noting the size of the table is around 410M on disk.

1
2
3
4
5
6
7
8
9
10
[elubow@db5 db]$ time ./tct_test.pl -b lists/ -D 2009-09-30 -c queue-mail.ini
usa: 99,272
top: 3,661,491
Total: 3,760,763

real    291m53.204s
user    4m53.557s
sys     2m35.604s
[root@db5 tmp]# ls -l
-rw-r--r-- 1 root root 410798800 Oct  6 23:15 mailings.tct

The structure for the hash database (seeing as its only key value) is as follows:

1
2
      "user@example.com" => "2009-09-30",
      "123456789" => "2009-09-30|2287",

The results of loading the same data into a hash database are as follows. It is also worth noting the size of the table is around 360M on disk. This is significantly smaller than the 410M of the table database containing the same style data.

1
2
3
4
5
6
7
8
9
10
[elubow@db5 db]$ time ./tch_test.pl -b lists/ -D 2009-09-30 -c queue-mail.ini
usa: 99,272
top: 3,661,491
Total: 3,760,763

real    345m29.444s
user    2m23.338s
sys     2m15.768s
[root@db5 tmp]# ls -l
-rw-r--r-- 1 root root 359468816 Oct  7 17:50 mailings.tch

For the second round, I loaded a second days worth of data in to the database. I used the same layouts as above with the following noteworthy items:

  • I did a get first prior to the put to decide whether to use put or putcat
  • The new data structure is now either “2009-09-30,2009-10-01” or “2009-09-30|1995,2009-10-01|1996”

Results of the hash database test round 2:

1
2
3
4
5
6
7
8
9
10
11
[elubow@db5 db]$ time ./tch_test.pl -b lists/ -D 2009-10-01 -c queue-mail.ini
luxe: 936,911
amex: 599,981
mex: 39,700
Total: 1,576,592

real    177m55.280s
user    1m53.289s
sys     2m8.606s
[elubow@db5 db]$ ls -l
-rw-r--r-- 1 root root 461176784 Oct  7 23:44 mailings.tch

Results of the table database test round 2:

1
2
3
4
5
6
7
8
9
10
11
[elubow@db5 db]$ time ./tct_test.pl -b lists/ -D 2009-10-01 -c queue-mail.ini
luxe: 936,911
amex: 599,981
mex: 39,700
Total: 1,576,592

real    412m19.007s
user    4m39.064s
sys     2m22.343s
[elubow@db5 db]$ ls -l
-rw-r--r-- 1 root root 512258816 Oct  8 12:41 mailings.tct

When it comes down to the final implementation, I will likely be parallelizing the put in some form. I would like to think that a database designed for this sort of thing works best in a concurrent environment (especially considering the default startup value is 8 threads).

It is obvious that when it comes to load times, that the hash database is much faster. Now its time to run some queries and see how this stuff goes down.

So I ran some queries first against the table database. I grabbed a new list of 3.6 million email addresses and iterated over the list, grabbed the record from the table database and counted how many dates (via array value counts) were entered for that email address. I ran the script 4 times and results were as follows. I typically throw out the first run since caching kicks in for the other runs.

1
2
3
4
Run 1: 10m35.689s
Run 2: 5m41.896s
Run 3: 5m44.505s
Run 4: 5m44.329s

Doing the same thing for the hash database, I got the following result set:

1
2
3
4
Run 1: 7m54.292s
Run 2: 4m13.467s
Run 3: 3m59.302s
Run 4: 4m13.277s

I think the results speak for themselves. A hash database is obviously faster (which is something most of us assumed from the beginning). The rest of time comes form programmatic comparisons like date comparisons in specific slices of the array. Load times can be sped up using concurrency, but given the requirements of the project, the gets have to be done in this sequential fashion.

Now its on to testing Cassandra in a similar fashion for comparison.

Causing More Problems Than You Solve

Wednesday, October 7th, 2009

To start off, if you know me personally, then you know I recently (July 30, 2009) broke my leg skydiving. If you’re interested, you can see this video on Youtube here. To make a long story short, I had surgery that night, they put a titanium rod in my thigh and I have been on crutches since. I have only recently started learning to walk again (which I have no yet reached that point). This week my insurance decided that it was no longer necessary to send me to Physical Therapy (thanks Oxford).

Like any corporation, Oxford is in the business of making money and in this case, they are doing so by deciding not to pay for my PT. In the long run, the lack of rehabilitation will likely leave me in a weakened state and generally more prone to injury once I go back to my skydiving, motorcycle riding, MMA, and BASE jumping ways. If Oxford had said, let’s make sure he can walk and then we’ll cut him off, at least he’ll have a foundation and be less prone to injury; then they might be saving a bit of money on me in the long run.

So what does this sob story have to do with IT? A decision made now in order to save money can end up costing you more of time and money in the long run. And since time is money, sometimes a little bit of planning can go a long way. Should you add the feature now because your biggest client wants it by Friday. Well if you do that, then you might lose a few smaller clients along the way and the word of mouth may be more damaging than temporarily upsetting that large client.

Perhaps you set up Nagios and immediately turned on alerting without learning the thresholds that your machines typically sit at. Then you get a whole set of alerts and you spend more time trying to sort through the real problem ones versus the ones that just have a slightly abnormal operating level then you would if you just looked at your machines thresholds to begin with.

There are a million examples that could be listed here. The point is, before jumping into a decision, try to make sure that you’re not going to be paying for it in the long run. A little planning can go a long way.