Migrations Without belongs_to Or references

Normally when do a database migration in Rails, when adding ownership from a model to another model, you use the concept of belongs_to or references:

  create_table :comments do |t|
    t.belongs_to :user
    t.references :post
  end

Interestingly enough, these methods are only available during the initial table creation. If you want to add a reference to a model that is created later, you have to do it the old fashioned way, by just adding a column:

   add_column :comments, :group_id, :integer

Doing it this way is clean, easy, and definitely meets the KISS principle. But I do find it interesting that one can’t add an association later in the game. Sometimes the Rails way is just KISS and adding the column by hand.

Posted in Rails. Tags: . No Comments »

Parsing Ini Files With Ruby

There doesn’t seem a lot of documentation or examples about parsing ini files in Ruby. There are definitely shortcut ways to do it and I could always write my own methods, but why reinvent the wheel when there are gems? So start out by simply installing the inifile gem.

beacon:~ elubow$ sudo gem install inifile
Successfully installed inifile-0.1.0
1 gem installed
Installing ri documentation for inifile-0.1.0...
Installing RDoc documentation for inifile-0.1.0...

The code for the gem is available from github here. Other inifile documentation is available here. The rest of the inifile documentation is a good reference but doesn’t contain any examples.

For some reason (which I don’t understand so please feel free to explain it in the comments if you know), you have to do more than just the standard require statement for this gem. At the top of your Ruby code, add the lines below. Make sure that you replace the directory location with your directory location.

require 'rubygems'
$:.unshift( '/usr/lib64/ruby/gems/1.8/gems/inifile-0.1.0/lib/' )
require 'inifile'

A short example of the ini file that we will be working with:

[foo]
bar = "baz"
dir = "2009-10-05/"
id = 75

To get the id parameter of the ini file assuming you know its in the [foo]section, you can use the code below. Notice the parameter section of the new object instantiator. The reason for this is that ini files are pretty abstract can have a few variations on format. Therefore you can specify the comment style and parameter definition style during object instantiation. My ini files use the ‘=’ to assign parameters

  ini = IniFile.new( options[:conf], :parameter => '=' )
  section = ini[foo]
  id = section['id']

Using the above code the id variable now contains the contents of the id parameter from the ini file.

Posted in Ruby. Tags: , . 2 Comments »

Tokyo Tyrant and Tokyo Cabinet

Tokyo Tyrant and Tokyo Cabinet are the components for a database used by Mixi (basically a Japanese Facebook). And for work, I got to play with these tools for some research. Installing all this stuff along with the Perl APIs is incredibly easy.

Ultimately I am working on a comparison of Cassandra and Tokyo Cabinet, but I will get to more on Cassandra later.

Ideally the tests I am going to be doing are fairly simple. I am going to be loading a few million rows into a TCT database (which is a table database in TC terms) and then loading key, value pairs into the database. The layout in a hash format is basically going to be as follows:

{
      "user@example.com" => {   "sendDates" => {"2009-09-30"},   },
      "123456789" => {  "2009-09-30" => {"2287"}   },
}

I ran these tests in the following formats for INSERTing the data into the a table database and as serialized data in a hash database. It is necessary to point out that the load on this machine is the normal load. Therefore it cannot be a true benchmark. Since the conditions are not optimal (but really, when are they ever), take the results with a grain of salt. Also, there is some data munging going on during every iteration to grab the email addresses and other data. All this is being done through the Perl API and Tokyo Tyrant. The machine that this is running on is a Dual Dual Core 2.5GHz Intel Xeon processor with 16G of memory.

For the first round, a few things should be noted:

  • The totals referenced below are email address counts add/modified in the db
  • I am only using 1 connection to the Tokyo Tyrant DB and it is currently setup to handle 8 threads
  • I didn’t do any memory adjustment on startup, so the default (which is marginal) is in use
  • I am only using the standard put operations, not putcat, putkeep, or putnr (which I will be using later)

The results of the table database are as follows. It is also worth noting the size of the table is around 410M on disk.

[elubow@db5 db]$ time ./tct_test.pl -b lists/ -D 2009-09-30 -c queue-mail.ini
usa: 99,272
top: 3,661,491
Total: 3,760,763

real    291m53.204s
user    4m53.557s
sys     2m35.604s
[root@db5 tmp]# ls -l
-rw-r--r-- 1 root root 410798800 Oct  6 23:15 mailings.tct

The structure for the hash database (seeing as its only key value) is as follows:

      "user@example.com" => "2009-09-30",
      "123456789" => "2009-09-30|2287",

The results of loading the same data into a hash database are as follows. It is also worth noting the size of the table is around 360M on disk. This is significantly smaller than the 410M of the table database containing the same style data.

[elubow@db5 db]$ time ./tch_test.pl -b lists/ -D 2009-09-30 -c queue-mail.ini
usa: 99,272
top: 3,661,491
Total: 3,760,763

real    345m29.444s
user    2m23.338s
sys     2m15.768s
[root@db5 tmp]# ls -l
-rw-r--r-- 1 root root 359468816 Oct  7 17:50 mailings.tch

For the second round, I loaded a second days worth of data in to the database. I used the same layouts as above with the following noteworthy items:

  • I did a get first prior to the put to decide whether to use put or putcat
  • The new data structure is now either “2009-09-30,2009-10-01″ or “2009-09-30|1995,2009-10-01|1996″

Results of the hash database test round 2:

[elubow@db5 db]$ time ./tch_test.pl -b lists/ -D 2009-10-01 -c queue-mail.ini
luxe: 936,911
amex: 599,981
mex: 39,700
Total: 1,576,592

real    177m55.280s
user    1m53.289s
sys     2m8.606s
[elubow@db5 db]$ ls -l
-rw-r--r-- 1 root root 461176784 Oct  7 23:44 mailings.tch

Results of the table database test round 2:

[elubow@db5 db]$ time ./tct_test.pl -b lists/ -D 2009-10-01 -c queue-mail.ini
luxe: 936,911
amex: 599,981
mex: 39,700
Total: 1,576,592

real    412m19.007s
user    4m39.064s
sys     2m22.343s
[elubow@db5 db]$ ls -l
-rw-r--r-- 1 root root 512258816 Oct  8 12:41 mailings.tct

When it comes down to the final implementation, I will likely be parallelizing the put in some form. I would like to think that a database designed for this sort of thing works best in a concurrent environment (especially considering the default startup value is 8 threads).

It is obvious that when it comes to load times, that the hash database is much faster. Now its time to run some queries and see how this stuff goes down.

So I ran some queries first against the table database. I grabbed a new list of 3.6 million email addresses and iterated over the list, grabbed the record from the table database and counted how many dates (via array value counts) were entered for that email address. I ran the script 4 times and results were as follows. I typically throw out the first run since caching kicks in for the other runs.

Run 1: 10m35.689s
Run 2: 5m41.896s
Run 3: 5m44.505s
Run 4: 5m44.329s

Doing the same thing for the hash database, I got the following result set:

Run 1: 7m54.292s
Run 2: 4m13.467s
Run 3: 3m59.302s
Run 4: 4m13.277s

I think the results speak for themselves. A hash database is obviously faster (which is something most of us assumed from the beginning). The rest of time comes form programmatic comparisons like date comparisons in specific slices of the array. Load times can be sped up using concurrency, but given the requirements of the project, the gets have to be done in this sequential fashion.

Now its on to testing Cassandra in a similar fashion for comparison.

Causing More Problems Than You Solve

To start off, if you know me personally, then you know I recently (July 30, 2009) broke my leg skydiving. If you’re interested, you can see this video on Youtube here. To make a long story short, I had surgery that night, they put a titanium rod in my thigh and I have been on crutches since. I have only recently started learning to walk again (which I have no yet reached that point). This week my insurance decided that it was no longer necessary to send me to Physical Therapy (thanks Oxford).

Like any corporation, Oxford is in the business of making money and in this case, they are doing so by deciding not to pay for my PT. In the long run, the lack of rehabilitation will likely leave me in a weakened state and generally more prone to injury once I go back to my skydiving, motorcycle riding, MMA, and BASE jumping ways. If Oxford had said, let’s make sure he can walk and then we’ll cut him off, at least he’ll have a foundation and be less prone to injury; then they might be saving a bit of money on me in the long run.

So what does this sob story have to do with IT? A decision made now in order to save money can end up costing you more of time and money in the long run. And since time is money, sometimes a little bit of planning can go a long way. Should you add the feature now because your biggest client wants it by Friday. Well if you do that, then you might lose a few smaller clients along the way and the word of mouth may be more damaging than temporarily upsetting that large client.

Perhaps you set up Nagios and immediately turned on alerting without learning the thresholds that your machines typically sit at. Then you get a whole set of alerts and you spend more time trying to sort through the real problem ones versus the ones that just have a slightly abnormal operating level then you would if you just looked at your machines thresholds to begin with.

There are a million examples that could be listed here. The point is, before jumping into a decision, try to make sure that you’re not going to be paying for it in the long run. A little planning can go a long way.

Posted in Misc. Tags: . 1 Comment »