<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Erics Tech Blog &#187; Tokyo Cabinet</title>
	<atom:link href="http://eric.lubow.org/tag/tokyo-cabinet/feed/" rel="self" type="application/rss+xml" />
	<link>http://eric.lubow.org</link>
	<description>Thoughts, musings, and other idealistic (sometimes useful) systems and development hoopla.</description>
	<lastBuildDate>Fri, 18 Nov 2011 14:56:45 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.4</generator>
		<item>
		<title>Tokyo Tyrant and Tokyo Cabinet</title>
		<link>http://eric.lubow.org/2009/databases/tokyo-tyrant-and-tokyo-cabinet/</link>
		<comments>http://eric.lubow.org/2009/databases/tokyo-tyrant-and-tokyo-cabinet/#comments</comments>
		<pubDate>Fri, 09 Oct 2009 14:00:38 +0000</pubDate>
		<dc:creator>eric</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Tokyo Cabinet]]></category>

		<guid isPermaLink="false">http://eric.lubow.org/?p=278</guid>
		<description><![CDATA[Tokyo Tyrant and Tokyo Cabinet are the components for a database used by Mixi (basically a Japanese Facebook). And for work, I got to play with these tools for some research. Installing all this stuff along with the Perl APIs is incredibly easy. Ultimately I am working on a comparison of Cassandra and Tokyo Cabinet, [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://1978th.net/tokyotyrant/">Tokyo Tyrant</a> and <a href="http://1978th.net/tokyocabinet/">Tokyo Cabinet</a> are the components for a database used by Mixi (basically a Japanese Facebook).  And for work, I got to play with these tools for some research.  Installing all this stuff along with the Perl APIs is incredibly easy.</p>
<p>Ultimately I am working on a comparison of <a href="http://incubator.apache.org/cassandra/">Cassandra</a> and <a href="http://1978th.net/">Tokyo Cabinet</a>, but I will get to more on <a href="http://incubator.apache.org/cassandra/">Cassandra</a> later.</p>
<p>Ideally the tests I am going to be doing are fairly simple. I am going to be loading a few million rows into a TCT database (which is a table database in TC terms) and then loading key, value pairs into the database.  The layout in a hash format is basically going to be as follows:</p>
<div class="codecolorer-container perl default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #ff0000;">&quot;user@example.com&quot;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span> &nbsp; <span style="color: #ff0000;">&quot;sendDates&quot;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span><span style="color: #ff0000;">&quot;2009-09-30&quot;</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span> &nbsp; <span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #ff0000;">&quot;123456789&quot;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span> &nbsp;<span style="color: #ff0000;">&quot;2009-09-30&quot;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #009900;">&#123;</span><span style="color: #ff0000;">&quot;2287&quot;</span><span style="color: #009900;">&#125;</span> &nbsp; <span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span><br />
<span style="color: #009900;">&#125;</span></div></div>
<p>I ran these tests in the following formats for INSERTing the data into the a table database and as serialized data in a hash database.  It is necessary to point out that the load on this machine is the normal load.  Therefore it cannot be a true benchmark.  Since the conditions are not optimal (but really, when are they ever), take the results with a grain of salt.  Also, there is some data munging going on during every iteration to grab the email addresses and other data.  All this is being done through the Perl API and Tokyo Tyrant.  The machine that this is running on is a Dual Dual Core 2.5GHz Intel Xeon processor with 16G of memory.</p>
<p>For the first round, a few things should be noted:</p>
<ul>
<li>The totals referenced below are email address counts add/modified in the db</li>
<li>I am only using 1 connection to the Tokyo Tyrant DB and it is currently setup to handle 8 threads</li>
<li>I didn&#8217;t do any memory adjustment on startup, so the default (which is marginal) is in use</li>
<li>I am only using the standard put operations, not <em>putcat</em>, <em>putkeep</em>, or <em>putnr</em> (which I will be using later)</li>
</ul>
<p>The results of the table database are as follows.  It is also worth noting the size of the table is around 410M on disk.</p>
<div class="codecolorer-container bash default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #7a0874; font-weight: bold;">&#91;</span>elubow<span style="color: #000000; font-weight: bold;">@</span>db5 db<span style="color: #7a0874; font-weight: bold;">&#93;</span>$ <span style="color: #000000; font-weight: bold;">time</span> .<span style="color: #000000; font-weight: bold;">/</span>tct_test.pl <span style="color: #660033;">-b</span> lists<span style="color: #000000; font-weight: bold;">/</span> <span style="color: #660033;">-D</span> <span style="color: #000000;">2009</span>-09-<span style="color: #000000;">30</span> <span style="color: #660033;">-c</span> queue-mail.ini <br />
usa: <span style="color: #000000;">99</span>,<span style="color: #000000;">272</span><br />
top: <span style="color: #000000;">3</span>,<span style="color: #000000;">661</span>,<span style="color: #000000;">491</span><br />
Total: <span style="color: #000000;">3</span>,<span style="color: #000000;">760</span>,<span style="color: #000000;">763</span><br />
<br />
real &nbsp; &nbsp;291m53.204s<br />
user &nbsp; &nbsp;4m53.557s<br />
sys &nbsp; &nbsp; 2m35.604s<br />
<span style="color: #7a0874; font-weight: bold;">&#91;</span>root<span style="color: #000000; font-weight: bold;">@</span>db5 tmp<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #666666; font-style: italic;"># ls -l</span><br />
<span style="color: #660033;">-rw-r--r--</span> <span style="color: #000000;">1</span> root root <span style="color: #000000;">410798800</span> Oct &nbsp;<span style="color: #000000;">6</span> <span style="color: #000000;">23</span>:<span style="color: #000000;">15</span> mailings.tct</div></div>
<p>The structure for the hash database (seeing as its only key value) is as follows:</p>
<div class="codecolorer-container perl default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="perl codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&nbsp; &nbsp; &nbsp; <span style="color: #ff0000;">&quot;user@example.com&quot;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #ff0000;">&quot;2009-09-30&quot;</span><span style="color: #339933;">,</span><br />
&nbsp; &nbsp; &nbsp; <span style="color: #ff0000;">&quot;123456789&quot;</span> <span style="color: #339933;">=&gt;</span> <span style="color: #ff0000;">&quot;2009-09-30|2287&quot;</span><span style="color: #339933;">,</span></div></div>
<p>The results of loading the same data into a hash database are as follows. It is also worth noting the size of the table is around 360M on disk.  This is significantly smaller than the 410M of the table database containing the same style data.</p>
<div class="codecolorer-container bash default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #7a0874; font-weight: bold;">&#91;</span>elubow<span style="color: #000000; font-weight: bold;">@</span>db5 db<span style="color: #7a0874; font-weight: bold;">&#93;</span>$ <span style="color: #000000; font-weight: bold;">time</span> .<span style="color: #000000; font-weight: bold;">/</span>tch_test.pl <span style="color: #660033;">-b</span> lists<span style="color: #000000; font-weight: bold;">/</span> <span style="color: #660033;">-D</span> <span style="color: #000000;">2009</span>-09-<span style="color: #000000;">30</span> <span style="color: #660033;">-c</span> queue-mail.ini <br />
usa: <span style="color: #000000;">99</span>,<span style="color: #000000;">272</span><br />
top: <span style="color: #000000;">3</span>,<span style="color: #000000;">661</span>,<span style="color: #000000;">491</span><br />
Total: <span style="color: #000000;">3</span>,<span style="color: #000000;">760</span>,<span style="color: #000000;">763</span><br />
<br />
real &nbsp; &nbsp;345m29.444s<br />
user &nbsp; &nbsp;2m23.338s<br />
sys &nbsp; &nbsp; 2m15.768s<br />
<span style="color: #7a0874; font-weight: bold;">&#91;</span>root<span style="color: #000000; font-weight: bold;">@</span>db5 tmp<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #666666; font-style: italic;"># ls -l</span><br />
<span style="color: #660033;">-rw-r--r--</span> <span style="color: #000000;">1</span> root root <span style="color: #000000;">359468816</span> Oct &nbsp;<span style="color: #000000;">7</span> <span style="color: #000000;">17</span>:<span style="color: #000000;">50</span> mailings.tch</div></div>
<p></p>
<p>For the second round, I loaded a second days worth of data in to the database.  I used the same layouts as above with the following noteworthy items:</p>
<ul>
<li>I did a <em>get</em> first prior to the <em>put</em> to decide whether to use <em>put</em> or <em>putcat</em></li>
<li>The new data structure is now either &#8220;2009-09-30,2009-10-01&#8243; or &#8220;2009-09-30|1995,2009-10-01|1996&#8243;</li>
</ul>
<p>Results of the hash database test round 2:</p>
<div class="codecolorer-container bash default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #7a0874; font-weight: bold;">&#91;</span>elubow<span style="color: #000000; font-weight: bold;">@</span>db5 db<span style="color: #7a0874; font-weight: bold;">&#93;</span>$ <span style="color: #000000; font-weight: bold;">time</span> .<span style="color: #000000; font-weight: bold;">/</span>tch_test.pl <span style="color: #660033;">-b</span> lists<span style="color: #000000; font-weight: bold;">/</span> <span style="color: #660033;">-D</span> <span style="color: #000000;">2009</span>-<span style="color: #000000;">10</span>-01 <span style="color: #660033;">-c</span> queue-mail.ini <br />
luxe: <span style="color: #000000;">936</span>,<span style="color: #000000;">911</span><br />
amex: <span style="color: #000000;">599</span>,<span style="color: #000000;">981</span><br />
mex: <span style="color: #000000;">39</span>,<span style="color: #000000;">700</span><br />
Total: <span style="color: #000000;">1</span>,<span style="color: #000000;">576</span>,<span style="color: #000000;">592</span><br />
<br />
real &nbsp; &nbsp;177m55.280s<br />
user &nbsp; &nbsp;1m53.289s<br />
sys &nbsp; &nbsp; 2m8.606s<br />
<span style="color: #7a0874; font-weight: bold;">&#91;</span>elubow<span style="color: #000000; font-weight: bold;">@</span>db5 db<span style="color: #7a0874; font-weight: bold;">&#93;</span>$ <span style="color: #c20cb9; font-weight: bold;">ls</span> <span style="color: #660033;">-l</span><br />
<span style="color: #660033;">-rw-r--r--</span> <span style="color: #000000;">1</span> root root <span style="color: #000000;">461176784</span> Oct &nbsp;<span style="color: #000000;">7</span> <span style="color: #000000;">23</span>:<span style="color: #000000;">44</span> mailings.tch</div></div>
<p>Results of the table database test round 2:</p>
<div class="codecolorer-container bash default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #7a0874; font-weight: bold;">&#91;</span>elubow<span style="color: #000000; font-weight: bold;">@</span>db5 db<span style="color: #7a0874; font-weight: bold;">&#93;</span>$ <span style="color: #000000; font-weight: bold;">time</span> .<span style="color: #000000; font-weight: bold;">/</span>tct_test.pl <span style="color: #660033;">-b</span> lists<span style="color: #000000; font-weight: bold;">/</span> <span style="color: #660033;">-D</span> <span style="color: #000000;">2009</span>-<span style="color: #000000;">10</span>-01 <span style="color: #660033;">-c</span> queue-mail.ini<br />
luxe: <span style="color: #000000;">936</span>,<span style="color: #000000;">911</span><br />
amex: <span style="color: #000000;">599</span>,<span style="color: #000000;">981</span><br />
mex: <span style="color: #000000;">39</span>,<span style="color: #000000;">700</span><br />
Total: <span style="color: #000000;">1</span>,<span style="color: #000000;">576</span>,<span style="color: #000000;">592</span><br />
<br />
real &nbsp; &nbsp;412m19.007s<br />
user &nbsp; &nbsp;4m39.064s<br />
sys &nbsp; &nbsp; 2m22.343s<br />
<span style="color: #7a0874; font-weight: bold;">&#91;</span>elubow<span style="color: #000000; font-weight: bold;">@</span>db5 db<span style="color: #7a0874; font-weight: bold;">&#93;</span>$ <span style="color: #c20cb9; font-weight: bold;">ls</span> <span style="color: #660033;">-l</span><br />
<span style="color: #660033;">-rw-r--r--</span> <span style="color: #000000;">1</span> root root <span style="color: #000000;">512258816</span> Oct &nbsp;<span style="color: #000000;">8</span> <span style="color: #000000;">12</span>:<span style="color: #000000;">41</span> mailings.tct</div></div>
<p>When it comes down to the final implementation, I will likely be parallelizing the <em>put</em> in some form.  I would like to think that a database designed for this sort of thing works best in a concurrent environment (especially considering the default startup value is 8 threads).</p>
<p>It is obvious that when it comes to load times, that the hash database is much faster.  Now its time to run some queries and see how this stuff goes down.</p>
<p>So I ran some queries first against the table database.  I grabbed a new list of 3.6 million email addresses and iterated over the list, grabbed the record from the table database and counted how many dates (via array value counts) were entered for that email address.  I ran the script 4 times and results were as follows.  I typically throw out the first run since caching kicks in for the other runs.</p>
<div class="codecolorer-container bash default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">Run <span style="color: #000000;">1</span>: 10m35.689s<br />
Run <span style="color: #000000;">2</span>: 5m41.896s<br />
Run <span style="color: #000000;">3</span>: 5m44.505s<br />
Run <span style="color: #000000;">4</span>: 5m44.329s</div></div>
<p>Doing the same thing for the hash database, I got the following result set:</p>
<div class="codecolorer-container bash default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">Run <span style="color: #000000;">1</span>: 7m54.292s<br />
Run <span style="color: #000000;">2</span>: 4m13.467s<br />
Run <span style="color: #000000;">3</span>: 3m59.302s<br />
Run <span style="color: #000000;">4</span>: 4m13.277s</div></div>
<p>I think the results speak for themselves.  A hash database is obviously faster (which is something most of us assumed from the beginning).  The rest of time comes form programmatic comparisons like date comparisons in specific slices of the array.  Load times can be sped up using concurrency, but given the requirements of the project, the <em>get</em>s have to be done in this sequential fashion.</p>
<p>Now its on to testing <a href="http://incubator.apache.org/cassandra/">Cassandra</a> in a similar fashion for comparison.</p>


<p>Related posts:<ol><li><a href='http://eric.lubow.org/2007/perl/creating-a-process-table-hash-in-perl/' rel='bookmark' title='Creating a Process Table hash in Perl'>Creating a Process Table hash in Perl</a></li>
<li><a href='http://eric.lubow.org/2009/databases/first-experience-with-cassandra/' rel='bookmark' title='First Experience With Cassandra'>First Experience With Cassandra</a></li>
<li><a href='http://eric.lubow.org/2010/databases/mysql/mysql-error-1033-incorrect-information-in-file/' rel='bookmark' title='MySQL Error 1033: Incorrect Information in File'>MySQL Error 1033: Incorrect Information in File</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://eric.lubow.org/2009/databases/tokyo-tyrant-and-tokyo-cabinet/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

