<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>FewBar.com - Make it good &#187; memcachedb</title>
	<atom:link href="http://fewbar.com/tag/memcachedb/feed/" rel="self" type="application/rss+xml" />
	<link>http://fewbar.com</link>
	<description>Technology, life, and mischief, not in that order</description>
	<lastBuildDate>Fri, 23 Dec 2011 01:41:49 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.5</generator>
		<item>
		<title>TokyoOops</title>
		<link>http://fewbar.com/2009/10/tokyo-tyrant-ignores-memcache-protocol-flags/</link>
		<comments>http://fewbar.com/2009/10/tokyo-tyrant-ignores-memcache-protocol-flags/#comments</comments>
		<pubDate>Tue, 27 Oct 2009 04:28:52 +0000</pubDate>
		<dc:creator>clint</dc:creator>
				<category><![CDATA[Memcache]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[berkeleydb]]></category>
		<category><![CDATA[caching]]></category>
		<category><![CDATA[memcachedb]]></category>
		<category><![CDATA[process]]></category>
		<category><![CDATA[RTFM]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[tokyotyrant]]></category>

		<guid isPermaLink="false">http://fewbar.com/?p=117</guid>
		<description><![CDATA[We had a fun time this week with TokyoTyrant. Recently it has become apparent that MemcacheDB has been all but abandoned. As fantastic as the early work was by Steve Chu, the project is in disrepair. That, coupled with the less than obvious failover for its replication combined to make us seek alternatives. Brian Aker [...]]]></description>
			<content:encoded><![CDATA[<p>We had a fun time this week with <a href="http://1978th.net/tokyotyrant/">TokyoTyrant</a>. Recently it has become apparent that <a href="http://www.memcachedb.org/">MemcacheDB</a> has been all but abandoned. As fantastic as the early work was by Steve Chu, the project is in disrepair. That, coupled with the <a href="http://fewbar.com/2009/03/memcachedb-fault-tolerance-procedures/">less than obvious failover for its replication</a> combined to make us seek alternatives.</p>
<p><a href="http://fewbar.com/wp-content/uploads/2009/10/virtual_stupidity.jpg"><img class="alignnone size-full wp-image-121" title="virtual_stupidity" src="http://fewbar.com.s3.amazonaws.com/wp-content/uploads/2009/10/virtual_stupidity.jpg" alt="virtual_stupidity" width="280" height="280" /></a></p>
<p><span id="more-117"></span><br />
<a href="http://krow.net">Brian Aker</a> had mentioned to me at one time that TokyoTyrant was way better than memcachedb and we should run it instead. I took notice and it turns out he&#8217;s right! It does basically the same thing, applying the memcache protocol to an on disk key/value store. However, the code is incredibly clean, well maintained, and runs extremely fast. There&#8217;s also a lot more flexibility, with the ability to choose between in-memory or on disk storage, hash tables, B+Tree&#8217;s, etc.</p>
<p>The availability of log based asynchronous master/master replication (somewhat similar to MySQL&#8217;s replication in concept) was probably one of the biggest wins, allowing much simpler failover (just move the IP, or DNS, or whatever) when compared to MemcacheDB&#8217;s adherence to BerkeleyDB&#8217;s replication setup, which is a single-master system implementing an election algorithm.</p>
<p>Somewhere during migration, we missed one tiny detail though. Sometimes, the devil is in the details. This is really the only evidence in <a href="http://1978th.net/tokyotyrant/spex.html#protocol">the documentation that tokyo tyrant has support for the memcache protocol</a>. It is very clear:</p>
<blockquote><p>Memcached Compatible Protocol</p>
<p>As for the memcached (ASCII) compatible protocol, the server implements the following commands; &#8220;set&#8221;, &#8220;add&#8221;, &#8220;replace&#8221;, &#8220;get&#8221;, &#8220;delete&#8221;, &#8220;incr&#8221;, &#8220;decr&#8221;, &#8220;stats&#8221;, &#8220;flush_all&#8221;, &#8220;version&#8221;, and &#8220;quit&#8221;. &#8220;noreply&#8221; options of update commands are also supported. However, &#8220;flags&#8221;, &#8220;exptime&#8221;, and &#8220;cas unique&#8221; parameters are ignored.</p></blockquote>
<p>Now, as I said, there&#8217;s nothing ambiguous about this. That would have helped, if anyone on my team had ever read it. We installed TokyoTyrant, pointed our basic test code at it, and it worked. This is really a process problem, not so much a technical one. The process must be to assume it won&#8217;t work, and test all the different use cases to make sure it works.</p>
<p>Now, why is that bit of the manual important? Well we use PHP. Specifically, we use the PECL &#8220;Memcache&#8221; module to access memcache protocol storage. Now, the Memcache module is mostly oriented toward caching in the memory based original memcached. It works great for memcachedb too, which simply ignores the exptime parameter. However, memcacheDB *does not* ignore &#8220;flags&#8221;.</p>
<p>And therein lies the problem. Users of the <a href="http://pecl.php.net/package/memcache">PECL Memcache module</a> may not know this, but the flags are *important*. There are two bits in that flags field that the Memcache module may set. Bit 0 is used to indicate whether or not the content has been serialized, and, therefore, on read, must be unserialized. Bit 1 is used to indicate whether or not the content has been gzipped.</p>
<p>So, while all of the strings that were stored in MemcacheDB and subsequently copied to TokyoTyrant worked great, the serialized objects, arrays, and gzipped values, were completely inoperative, as they were coming back to the code as strings and binary compressed data. The gzipped data was easy (turn off automatic gzip compression). The serialized data took some quick tap dancing to remedy, with code something like this:</p>
<p><code lang="php"><br />
class Memcache_BrokenFlags extends Memcache<br />
{<br />
public function get($key, &amp;$flags)<br />
{<br />
$v = parent::get($key, $flags);<br />
$uv = @unserialize($v);<br />
return $uv === false ? $v : $uv;<br />
}<br />
}<br />
</code></p>
<p>Luckily our code all uses one Factory method to spawn all &#8220;MemcacheDB&#8221; connections, so it was easy to substitute this in.</p>
<p>Eventually we can just change the code by segregating into things that always serialize, and things that don&#8217;t, and just do the serialization ourselves. This should eventually allow us to use the new <a href="http://pecl.php.net/package/tokyo_tyrant">tokyo_tyrant module in PECL</a>, which only reliably stores scalars (I noticed recent versions have added a call to the internal PHP function convert_to_string().. this is, I think, a mistake, but one that still leaves it up the programmer to explicitly serialize when serialization is desired).</p>
<p>This was a pretty big gotchya, and one that illustrates that even though sometimes us cowboy coders and sysadmins get annoyed when those pesky business people ask us for plans, schedules, expected impact, etc., and we keep assuring them we know whats up, its still important to actually know whats up, and make sure to RTFMC .. C as in, CAREFULLY.</p>
]]></content:encoded>
			<wfw:commentRss>http://fewbar.com/2009/10/tokyo-tyrant-ignores-memcache-protocol-flags/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>TokyoTyrant &#8211; MemcacheDB, but without the BDB?</title>
		<link>http://fewbar.com/2009/06/tokyotyrant-memcachedb-but-without-the-bdb/</link>
		<comments>http://fewbar.com/2009/06/tokyotyrant-memcachedb-but-without-the-bdb/#comments</comments>
		<pubDate>Thu, 04 Jun 2009 06:40:26 +0000</pubDate>
		<dc:creator>clint</dc:creator>
				<category><![CDATA[Memcache]]></category>
		<category><![CDATA[Scalability]]></category>
		<category><![CDATA[benchmarks]]></category>
		<category><![CDATA[drizzle]]></category>
		<category><![CDATA[memcachedb]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[tokyocabinet]]></category>
		<category><![CDATA[tokyotyrant]]></category>

		<guid isPermaLink="false">http://fewbar.com/?p=85</guid>
		<description><![CDATA[Anyway, the next thing I mentioned was that we had also tried MemcacheDB with some success. Brian wasn't exactly impressed with MemcacheDB, and immediately suggested that we should be using <a href="http://tokyocabinet.sourceforge.net/tyrantdoc/">Tokyo Tyrant</a> instead. I had heard of Tokyo Cabinet, the new hotness in local key/value storage and retrieval, but what is this Tyrant you speak of?]]></description>
			<content:encoded><![CDATA[<p>This past April I was riding in a late model, 2 door rental car with an interesting trio for sure. On my right sat <a href="http://capttofu.livejournal.com/">Patrick Galbraith</a>, maintainer of DBD::mysql and author of the Federated storage engine. Directly in front of me manning the steering wheel (for those of you keen on spatial description, you may have noted at this point that its most likely I was seated in the back, left seat of a car which is designed to be driven on the right side of the road. EOUF [end of useless fact]), David Axmark, co-founder of MySQL. Immediately to his right sat <a href="http://krow.net/">Brian Aker</a>, of (most recently) Drizzle fame.<br />
<span id="more-85"></span><br />
This was one of those conversations that I felt grossly unprepared for. It was the 2009 MySQL User&#8217;s conference, and  Patrick and I had been hacking on <a href="https://launchpad.net/dbd-drizzle">DBD::drizzle</a> for most of the day. We had it 98% of the way there and were in need of food, so we were joining the Drizzle dev team for gourmet pizza.</p>
<p>As we navigated from the Santa Clara conference center to Mountain View&#8217;s quaint downtown, Brian, Patrick, and I were discussing memcached stuff. I mentioned <a href="http://fewbar.com/2008/12/memcached-and-mogile-form-memcachemegazord/">my idea, and subsequent implementation of the Mogile+Memcached method for storing data more reliably</a> in memcached. I knew in my head why we had chosen to read from all of the replica servers, not just the first one that worked, but I forgot (The reason, btw, is that if one of the servers had missed a write for some reason, you might get out-of-date data). I guess I was a little overwhelmed by Brian&#8217;s mountain of experience w/ memcached.</p>
<p>Anyway, the next thing I mentioned was that we had also tried MemcacheDB with some success. Brian wasn&#8217;t exactly impressed with MemcacheDB, and immediately suggested that we should be using <a href="http://tokyocabinet.sourceforge.net/tyrantdoc/">Tokyo Tyrant</a> instead. I had heard of Tokyo Cabinet, the new hotness in local key/value storage and retrieval, but what is this Tyrant you speak of?</p>
<p>I&#8217;ve been playing with Tokyo Tyrant ever since, and advocating for its usage at Adicio. Its pretty impressive. In addition to speaking memcached protocol, it apparently speaks HTTP/WEBDAV  too. The ability to select hash, btree, and a host of other options is nice, though I&#8217;m sure some of these are available as obscure options to berkeleydb as well.</p>
<p>Anyway, I was curious what performance was like, so I did some tests on my little Xen instance, and came up with pretty graphs.</p>
<p><a href="http://fewbar.com/wp-content/uploads/2009/06/tokyotyrantvsmemcachedb1.gif"><img src="http://fewbar.com/wp-content/uploads/2009/06/tokyotyrantvsmemcachedb1.gif" alt="tokyotyrantvsmemcachedb1" title="tokyotyrantvsmemcachedb1" width="465" height="472" class="alignnone size-full wp-image-92" /></a></p>
<p>I used the excellent <a href="http://code.google.com/p/brutis/">Brutis</a> tool to run these benchmarks using the most interesting platform for me at the moment.. which would be, php with the pecl Memcache  module.</p>
<p>These numbers were specifically focused on usage that is typical to MemcacheDB. A wide range of keys (in this case, 10000 is &#8220;wide&#8221; since the testing system is very small), not-small items (2k or so), and lower write:read ratio (1:50). I had the tests restart each daemon after each run, and these numbers are the results of the average of 3 runs each test.</p>
<p>I also tried these from another xen instance on the same LAN, and things got a lot slower. Not really sure why as latency is in the sub-millisecond range.. but maybe Xen&#8217;s networking just isn&#8217;t very fast. Either way, the numbers for each combination didn&#8217;t change much.</p>
<p>What I find interesting is that memachedb in no-sync mode actually went faster than memached. Of course, in nosync mode, memcachedb is just throwing data at the disk. It doesn&#8217;t have to maintain LRU or slabs or anything.</p>
<p>Tokyo Tyrant was very consistent, and used *very* little RAM in all instances. I do recall reading that it compresses data. Maybe thats a default? Anyway, tokyo tyrant also was the most CPU hungry of the bunch, so I have to assume having more cores might have resulted in much better results.</p>
<p>I&#8217;d like to get together a set of 3 or 4 machines to test multiple client threads, and replication as well. Will post that as part 2 when I pull it together. For now, it looks like.</p>
<p>In case anybody wants to repeat these tests, I&#8217;ve included <a href="http://spamaps.org/files/tt-mdb-memcache-tests.tgz">the results, and the scripts used to generate them in this tarball</a>.</p>
<p>&#8211; Additional info, 6/4/2009<br />
Another graph that some might find interesting, is this one detailing CPU usage. During all the tests, brutis used about 60% of the CPU available on the machine, so 40% is really 100%:</p>
<p><a href="http://fewbar.com/wp-content/uploads/2009/06/tokyotyranttests_cpu.gif"><img src="http://fewbar.com/wp-content/uploads/2009/06/tokyotyranttests_cpu.gif" alt="tokyotyranttests_cpu" title="tokyotyranttests_cpu" width="428" height="385" class="alignnone size-full wp-image-98" /></a></p>
<p>This tells me that the CPU was the limiting factor for Tokyo Tyrant, and with a multi-core machine, we should see huge speed improvements. Stay tuned for those tests!</p>
]]></content:encoded>
			<wfw:commentRss>http://fewbar.com/2009/06/tokyotyrant-memcachedb-but-without-the-bdb/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>MemcacheDB fault tolerance procedures</title>
		<link>http://fewbar.com/2009/03/memcachedb-fault-tolerance-procedures/</link>
		<comments>http://fewbar.com/2009/03/memcachedb-fault-tolerance-procedures/#comments</comments>
		<pubDate>Wed, 25 Mar 2009 18:07:24 +0000</pubDate>
		<dc:creator>clint</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[fault tolerance]]></category>
		<category><![CDATA[heartbeat]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[memcachedb]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[reliability]]></category>
		<category><![CDATA[Scalability]]></category>

		<guid isPermaLink="false">http://fewbar.com/?p=46</guid>
		<description><![CDATA[It semeed so simple, just setup two memcachedb instances and point them at eachother. Instant fault tolerance, Right? If only it were so simple! Its not entirely clear from the documentation how to setup memcachedb for fault tolerance. Here&#8217;s the procedures I&#8217;ve found useful. Set up replication right. With all due respect to Steve Chu, [...]]]></description>
			<content:encoded><![CDATA[<p>It semeed so simple, just setup two memcachedb instances and point them at eachother. Instant fault tolerance, Right? If only it were so simple!</p>
<p>Its not entirely clear from the documentation how to setup memcachedb for fault tolerance. Here&#8217;s the procedures I&#8217;ve found useful.<br />
<span id="more-46"></span></p>
<ul>
<li><strong>Set up replication right</strong>. With all due respect to Steve Chu, The docs aren&#8217;t really clear on how to setup replication. Its much simpler than it looks. Just run MemcacheDB as you would if it were standalone, but then add a combination of these 3 options:
<ul>
<li>You must have a -R line if you want to participate in replication. This is your hostname and port that listens for connections from other machines for replication. It is the same value that should be listed in every other machine&#8217;s -O.</li>
<li>a -O for *every* other machine that may want to replicate to/from this machine. I am sure there are situations where you won&#8217;t need these, but it makes re-syncing and elections more predictable. You won&#8217;t be able to re-sync &#8220;live&#8221; after a failure without -O options.</li>
<li>-M/-S are not required. If you start n machines without -M or -S, but with appropriate -R and -O lines, they will arbitrarily elect a master. If you run them with -M and -S, then the -M box will just be pushy and always elect itself the master, and the -S boxes will, likewise, always try to defer to slave status.</li>
<li>Lets say we wanted to listen for memcache protocol on port 45000 on host &#8216;node1&#8242; and replicate to &#8216;node2&#8242;</li>
<li>Standalone: <code>memcachedb -p 45000 -H /home/memdb/data -u memdb -N</code></li>
<li>Replication w/ elected master: <code>memcachedb -p 45000 -H /home/memdb/data -u memdb -N -R node1:46000 -O node2:46000</code></li>
<li>Replication Master: <code>memcachedb -p 45000 -H /home/memdb/data -u memdb -N -R node1:46000 -O node2:46000 -M</code></li>
</ul>
</li>
<li><strong>Only the current master can accept writes</strong>. You can see which machine is the master with the &#8216;stats rep&#8217; command. In v1.2.1 its shown as an environment id. Below st_env_id and st_master are the same, so this is the master:<br />
<code><br />
stats rep<br />
STAT st_bulk_fills 0<br />
STAT st_bulk_overflows 0<br />
STAT st_bulk_records 11<br />
STAT st_bulk_transfers 3<br />
STAT st_client_rerequests 0<br />
STAT st_client_svc_miss 0<br />
STAT st_client_svc_req 0<br />
STAT st_dupmasters 0<br />
STAT st_egen 3<br />
STAT st_election_cur_winner 2147483647<br />
STAT st_election_gen 0<br />
STAT st_election_lsn 1/28<br />
STAT st_election_nsites 0<br />
STAT st_election_nvotes 1<br />
STAT st_election_priority 100<br />
STAT st_election_sec 5<br />
STAT st_election_status 0<br />
STAT st_election_tiebreaker 3676766282<br />
STAT st_election_usec 69747<br />
STAT st_election_votes 0<br />
STAT st_elections 1<br />
STAT st_elections_won 1<br />
STAT st_env_id 2147483647<br />
STAT st_env_priority 100<br />
STAT st_gen 2<br />
STAT st_log_duplicated 0<br />
STAT st_log_queued 0<br />
STAT st_log_queued_max 0<br />
STAT st_log_queued_total 0<br />
STAT st_log_records 0<br />
STAT st_log_requested 0<br />
STAT st_master 2147483647<br />
STAT st_master_changes 0<br />
STAT st_max_lease_sec 0<br />
STAT st_max_lease_usec 0<br />
STAT st_max_perm_lsn 0/0<br />
STAT st_msgs_badgen 0<br />
STAT st_msgs_processed 5<br />
STAT st_msgs_recover 0<br />
STAT st_msgs_send_failures 2<br />
STAT st_msgs_sent 10<br />
STAT st_newsites 0<br />
STAT st_next_lsn 1/8916<br />
STAT st_next_pg 0<br />
STAT st_nsites 2<br />
STAT st_nthrottles 0<br />
STAT st_outdated 0<br />
STAT st_pg_duplicated 0<br />
STAT st_pg_records 0<br />
STAT st_pg_requested 0<br />
STAT st_startsync_delayed 0<br />
STAT st_startup_complete 0<br />
STAT st_status 2<br />
STAT st_txns_applied 0<br />
STAT st_waiting_lsn 0/0<br />
STAT st_waiting_pg 0<br />
END<br />
</code><br />
However, its much simpler, I think, to just try and store a value on an instance. If you get &#8220;STORED&#8221; back, then this is the master. If you get NOT_STORED back, this is a slave. If it blocks (timeouts are hard in simple scripts, I know.. perldoc -f alarm), you are in a &#8220;DOWN&#8221; state. The danger here is one of split brain where both nodes thing they&#8217;re the master.. but.. if they&#8217;re not talking, you have bigger problems!</li>
<li><strong>Out of sync slaves can&#8217;t READ either!</strong> This one bit us just the other day. Something ocurred where our slave wasn&#8217;t able to retrieve the latest log entries from the master. Because of this, it was reporting errors in replication. During this time, *all* commands blocked. We were relying on basic round-robin DNS for failover, thinking that memcachedb was simple enough, it was either &#8220;up&#8221; or &#8220;down&#8221;. Unfortunately, it was stalled on one box, so everything that hit that box blocked and timed out until we firewalled the port so connections wouldn&#8217;t succeed. We eventually had to stop the instance, copy a db_hotbackup from the master, then start it again. This still had to catch up from the point at which the db_hotbackup copies logs were checkpointed, which was (because we&#8217;re on v1.0.3) many hours before. While it was catching up, all commands (even stats commands.. which is disappointing..) blocked.
</li>
<li><strong>Use a load balancer, not round robin</strong>. With that said, a load balancer is a far better solution then round robin. In this case, because the box was &#8220;up&#8221;, but failing to respond, we were at the mercy of the pecl memcache module&#8217;s definition of what was &#8220;up&#8221; or &#8220;down&#8221; for reads. A load balancer separates this logic out into monitors so the code can just connect to a virtual IP, or use some list of servers it is given.</li>
<li><strong>Even better.. just use a floating IP</strong>. MemcacheDB seems to scale to ridiculous levels with reads. Like, 400:1 read:write performance. Do you really need lots of slaves? Just having an IP that follows the master will give you fault tolerance. Its easy to determine if a box is the master. You can even do a &#8216;rep_set_priority 500&#8242; to make sure a box stays the master as long as it has the IP. If you&#8217;re running on Linux, Good old <a href="http://www.linux-ha.org/">Heartbeat</a> is perfect for this. If you need to scale past the write capabilities of one box, then partitioning by using a stable hash algorithm on the keys is a far better solution than master/slave replication, and is already built in to pretty much every memcache client.</li>
<li><strong>Be careful with db_archive/db_checkpoint</strong>. This is mostly regarding v1.0.3, as I don&#8217;t know the impact of these commands on v1.1 or 1.2. However, it would seem that even with a replication policy of &#8220;ACK_ONE&#8221;, its still possible to purge logs that the slave needs. This may or may not be true (something else could have gone wrong) but it seems that running db_checkpoint/db_archive too aggressively seems to have broken our replication. There&#8217;s no reason to purge logs too often, so be wary when doing so.</li>
</ul>
<p>Hopefully this will help other users who are starting to setup MemcacheDB and need fault tolerance.</p>
]]></content:encoded>
			<wfw:commentRss>http://fewbar.com/2009/03/memcachedb-fault-tolerance-procedures/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Memcached and Mogile Form MemcacheMegaZord!</title>
		<link>http://fewbar.com/2008/12/memcached-and-mogile-form-memcachemegazord/</link>
		<comments>http://fewbar.com/2008/12/memcached-and-mogile-form-memcachemegazord/#comments</comments>
		<pubDate>Sun, 14 Dec 2008 17:21:50 +0000</pubDate>
		<dc:creator>clint</dc:creator>
				<category><![CDATA[Scalability]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[memcached]]></category>
		<category><![CDATA[memcachedb]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[sessions]]></category>

		<guid isPermaLink="false">http://fewbar.com/?p=27</guid>
		<description><![CDATA[So I was starting to play with Memcached for session storage, and I found a fairly big problem with justing memcached in its normal caching mode as a session store. It really just boils down to caching and storing of deterministic data being very different things that only look similar on the surface. So normally, [...]]]></description>
			<content:encoded><![CDATA[<p>So I was starting to play with Memcached for session storage, and I found a fairly big problem with justing memcached in its normal caching mode as a session store. It really just boils down to caching and storing of deterministic data being very different things that only look similar on the surface.<br />
<span id="more-27"></span><br />
So normally, memcached is used in a very clever way by adding a list of servers, and then using a hashing algorithm to pick a server to actually contact based on the key of a get/set request. This allows a ton of scaling out, with minimal moving parts. There&#8217;s no periodic monitor or broadcast protocol to add and remove cluster members to and from pools, so you can just run memcached on a bunch of servers, and use a consistent list across all of your machines to achieve a huge degree of scale out. When a server dies, the code just sees that, and moves on to the next one in the hash algorithm, and all is well.</p>
<p>For caching, this &#8220;failover&#8221; methodology works fine. If I go to set a value in memcached, and the server fails over to the second one, thats ok. The next get to the primary will fail, and get set properly, and the old entry on the secondary will *eventually* get pushed out of the cache.</p>
<p>However, for storing data reliably, this becomes a problem. Lets say there is a scenario where a network cable is bad on one of the memcached servers. 1 in 100 requests fails. With caching, failover will go a little nuts, but its entirely possible nobody will even notice, as results will be cached, data won&#8217;t get stale.. no big deal.</p>
<p>With storage though, this could happen..</p>
<p>- session is created on memache1</p>
<p>- session tries to read from memcache1, and fails.. so new session is created on memache2</p>
<p>- session is then read from memache1</p>
<p>- session is updated on memcache1 with new information</p>
<p>- session fails to read from memcache1, and old session data is read from memacache2, then the set succeeds on memcache1, and the old data is lost.</p>
<p>The point isn&#8217;t really this scenario&#8217;s details, but that this hashing algorithm is vulnerable, even designed to lose data that was written to it. That is the caching paradigm.</p>
<p>As I discussed this with some colleagues, my mind immediately jumped to <a href="http://www.memcachedb.org">MemcacheDB</a>. Maybe that would work for session storage. It has replication, so we could use the traditional active/passive paradigm for it. However, this limits our scale to whatever a single instance of MemcacheDB can handle. Honestly thats probably fine for most sites, as MemcacheDB can probably handle tens of thousands of small writes per second.</p>
<p>However, there are multiple problems. The biggest problem with MemcacheDB is there&#8217;s no easy way (yet, they&#8217;re working on it) to pull keys out of it to do garbage collection. Likewise, session data really doesn&#8217;t need to live for a long time. We just need to be reasonably certain that the data we&#8217;re getting is reasonably new.</p>
<p>If we store the data in *all* of the servers, and if we store a highly accurate (meaning if it takes you milliseconds to complete a request, this timestamp needs to be down to microseconds) timestamp of when the data was given to us (meaning we use the same timestamp for each server) along side it, we can then just read it from all of the servers, and pick the newest one. Ew, that means we are still limited to the scale of one instance of memcached.</p>
<p>Then I had a flash back to the way <a href="http://danga.com/mogilefs/">MogileFS</a> works. It stores data on a number of replica servers. Of course, it also keeps track of where it stored them. But I figured, for sessions, thats a lot of overhead. There&#8217;s an easier way. We can use the <a href="http://www.spiteful.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/">consistent hashing algorithm</a> that the PHP Memcache module uses to pick servers, and just read and write the data from nReplicas servers. If a server fails, we&#8217;ll move on to the next one, and there&#8217;s a reasonable degree of certainty that it will remain the same. If we write stale data to a server and then fail back to it later, we&#8217;re protected by the timestamp rules. The higher nReplicas, the higher the reliability that a server failure won&#8217;t cause issues. I even found <a href="http://paul.annesley.cc/articles/2008/04/30/flexihash-consistent-hashing-php">a PHP implementation of consistent hashing falled FlexiHash</a>.</p>
<p>There&#8217;s one last issue that bugs me about using memcached for sessioning, and the timestamp helps us solve. We recently found that there was a problem where a request would take, say, 45 seconds to complete. At 20 seconds, the user would hit the back button out of frustration. This would put other stuff in the session, then the 45 second request would complete, and write the version of the session it thinks is right to the session store, losing the user&#8217;s new activity.</p>
<p>There are two ways to solve this. One is to introduce locking. This actually isn&#8217;t hard to do with Memcached, it is <a href="http://www.socialtext.net/memcached/index.cgi?faq#emulating_locking_with_the_add_command">described in the memcached faq</a>. However, this introduces something to block or fail on in the read. I think its simpler than that. You simply read the record before you write it, and if it has changed since you read it the first time, you don&#8217;t write it. You just throw the session write away. Obviously the user has moved on, so there&#8217;s no reason to make your update.  If you used locking, the user would still be waiting on the old thread to finish.</p>
<p>Of course, this all hinges on you caring that your session data is accurate, and that you care that users don&#8217;t lose their sessions when one server goes down. If neither of those apply to you, then you can just use sessions like cache.</p>
]]></content:encoded>
			<wfw:commentRss>http://fewbar.com/2008/12/memcached-and-mogile-form-memcachemegazord/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Using memcachedb and memcached to make things scale</title>
		<link>http://fewbar.com/2008/06/using-memcachedb-and-memcached-make-things-scale/</link>
		<comments>http://fewbar.com/2008/06/using-memcachedb-and-memcached-make-things-scale/#comments</comments>
		<pubDate>Thu, 26 Jun 2008 05:40:01 +0000</pubDate>
		<dc:creator>clint</dc:creator>
				<category><![CDATA[Scalability]]></category>
		<category><![CDATA[caching]]></category>
		<category><![CDATA[memcached]]></category>
		<category><![CDATA[memcachedb]]></category>
		<category><![CDATA[sclability]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://fewbar.com/?p=9</guid>
		<description><![CDATA[I don&#8217;t remember exactly how I found memcachedb, however, it is one of those projects that somebody else beat me to the punch in writing. I mean, it was going to happen, as the need was there. Steve Chu, the author, did a great job of melding two open source projects, BerkeleyDB, and memcached, to [...]]]></description>
			<content:encoded><![CDATA[<p>I don&#8217;t remember exactly how I found <a href="http://www.memcachedb.org">memcachedb</a>, however, it is one of those projects that somebody else beat me to the punch in writing. I mean, it was going to happen, as the need was there. Steve Chu, the author, did a great job of melding two open source projects, <a href="http://www.oracle.com/database/berkeley-db/index.html">BerkeleyDB</a>, and <a href="http://www.danga.com/memcached/">memcached</a>, to produce something really very powerful<br />
<span id="more-9"></span><br />
Now, memcached has become almost completely ubiquitous in scaling web apps. Memcached is essentially a network enabled non-persistent data store. It is generally used as a <a href="http://en.wikipedia.org/wiki/Cache">write-back data cache</a>, meaning that you look in the faster cache, if nothing is there, you look in the slower place, then write the value back to the faster cache. Some industrious people have used it for session storage, and I&#8217;m sure a few other clever uses.</p>
<p>One of my favorite parts of memcached is how dead simple it is. The protocol is very easy to read, making debugging issues and writing new clients very easy. It uses the &#8220;least recently used&#8221; algorithm to move things out of the cache when it starts to fill up, so its extremely easy to understand how the whole thing works.</p>
<p>The cleverest part of using memcached has nothing to do with the service itself, but the API. The <a href="http://www.danga.com/">smart guys</a> who developed it figured out that they could hash the key, and pick the same server for reads/writes every time as long as the number of servers doesn&#8217;t change. This allows it to scale out to a ridiculous size and retains its simplicity and performance</p>
<p>Two problems arise when a site uses any caching, be it memcached or aggressive HTTP headers.</p>
<p>First, the site starts to rely on caching too heavily for performance. As an example, I had a situtation where the entire corpus of settings for each client site (hundreds of clients, hundreds of settings) was kept in memcached as one massive 200kB+ serialized PHP object. Every page view that needed to access any settings would grab this object at the beginning of the code, and use the object throughout.</p>
<p>This worked really great in some instances, as most of the biggest pages needed to access 30 &#8211; 50 settings each time. However, the trouble would come when there was a page that would get a high degree of concurrency, such as an iframe that gets displayed on every page of a major website, or on a page that gets slashdotted. It would be blazing fast, generating almost no load at all for a while, but whenever a setting would be changed (the settings application would clear the cache of settings for whichever client was edited), or the cache object would expire, the database would spike out of control.</p>
<p>The reason was this object took about 1-3 seconds to fetch from the database. Well with 1000 requests per second, thats 3000 requests that get a negative hit on the cache, and so, ask the database for the information. The solution was to cache each setting individually, and use a random skew on the expire time. This prevented the storm of requests whenever there was an expire, and it allowed items looked up in rapid succession to not expire all at once.</p>
<p>This brings us to the second problem with caching, and specifically memcached. The cache is sometimes mistaken for a data store. In the above example, by clearing out entries from memcached, the caching was essentially neutered. Any time during the day somebody might come along and blow out the cache. Thats fine with MySQL&#8217;s query cache, for instance, because that just makes queries come back faster. The connection is already made, one of the most painful parts has already happened. With memcached however, the cache can scale to many thousands of connections very cheaply, whereas doing this with most databases is expensive, if not impossible.</p>
<p>So to combat this, what is really needed is a persistent place to keep your data up to date when it is needed in an extremely high reads to write ratio. Thats where memcachedb is so attractive. Instead of keeping everything in RAM, memcachedb stores anything you put into it in a berkeleydb database. To boot, it can replicate this data to another machine, adding to its reliability and availability. This means that writes will be slower, and it won&#8217;t scale out nearly as cheaply, but thats ok for situations like this.</p>
<p>With memcachedb, we can change the setting management program to save the data into the database <strong>and </strong>memcachedb, confident in the fact that it will be there later. Then we don&#8217;t have write-back caching code in our application, we just remove the part that connects to the database for that data at all.</p>
<p>This has a huge benefit beyond just performance. With this scheme, we can write simple applications that won&#8217;t rely on the read/write database server ever being up. It also means that we don&#8217;t have to have a giant database server, or a huge replication fanout to get this data available in realtime.</p>
<p>There is of course the danger that memcachedb gets out of sync with the main db. Thats why in addition to writing to memcachedb whenever you write to the database server, you can also run a refresh script periodically that grabs all of the data from the database and walks through, writing items to memcachedb. Care must be taken here to make sure one doesn&#8217;t write stale data to memcachedb. The safest way is to include a timestamp with each record that can easily be compared. Another way to go is to just have this script alert you to items that are out of sync, requiring manually re-saving these records.</p>
<p>Memcachedb is, unfortunately, still a little raw. The replication setup is rather complex. It took me a little while to get it working the way I wanted with just two boxes. It definitely could use command line options to set replication options, so that slaves don&#8217;t accidentally promote themselves to masters. Right now one can only do that through the protocol, so I have a nagios plugin that checks it and changes it if it is wrong.</p>
<p>I think its important to note just how cool it is that 90% of memcachedb was written before it was conceived of. <a href="http://www.oracle.com/database/berkeley-db/index.html">BerkeleyDB</a> is one of the great open source success stories, having a successful business model built on free code, and eventually attracting enough attention from Oracle to get purchased. Then to merge that with memcached, which is one of those projects that makes you wish you had written it first, well, I think thats a stroke of genius. Good job Mr. Chu.</p>
]]></content:encoded>
			<wfw:commentRss>http://fewbar.com/2008/06/using-memcachedb-and-memcached-make-things-scale/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.198 seconds -->

