<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>FewBar.com - Make it good &#187; Scalability</title>
	<atom:link href="http://fewbar.com/tag/scalability/feed/" rel="self" type="application/rss+xml" />
	<link>http://fewbar.com</link>
	<description>Technology, life, and mischief, not in that order</description>
	<lastBuildDate>Fri, 23 Dec 2011 01:41:49 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.5</generator>
		<item>
		<title>So what is Ensemble anyway?</title>
		<link>http://fewbar.com/2011/06/so-what-is-ensemble-anyway/</link>
		<comments>http://fewbar.com/2011/06/so-what-is-ensemble-anyway/#comments</comments>
		<pubDate>Fri, 03 Jun 2011 18:53:08 +0000</pubDate>
		<dc:creator>clint</dc:creator>
				<category><![CDATA[Ensemble]]></category>
		<category><![CDATA[Scalability]]></category>
		<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[chef]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[config management]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[ensemble]]></category>
		<category><![CDATA[mediawiki]]></category>
		<category><![CDATA[orchestration]]></category>
		<category><![CDATA[principia]]></category>
		<category><![CDATA[puppet]]></category>
		<category><![CDATA[ubuntu]]></category>

		<guid isPermaLink="false">http://fewbar.com/?p=414</guid>
		<description><![CDATA[Have you heard of Ensemble? Are you excited about Cloud/Service Orchestration? What? Ok you&#8217;re not alone if you are scratching your head. Ensemble is an implementation of a new idea that has been taking shape the last couple of years. Ever since Amazon hooked up a remote API to thousands of machines to provide access [...]]]></description>
			<content:encoded><![CDATA[<p>Have you heard of <a href="http://ensemble.ubuntu.com/">Ensemble</a>? Are you excited about Cloud/Service Orchestration? What? Ok you&#8217;re not alone if you are scratching your head.</p>
<p>Ensemble is an implementation of a new idea that has been taking shape the last couple of years. Ever since Amazon hooked up a remote API to thousands of machines to provide access to their virtual infrastructure (and called it macaroni? err.. AWS), people have been dreaming up ways to take advantage of what is basically a robotic &#8220;NOC guy&#8221;. No longer do you have to pre-rack servers or call your vendor frantically to get servers sent next-day to your colo. Right?</p>
<p>Naturally, the system administrators that would normally be in charge of racking servers, applied their existing tools to the job, to mixed success. Config management is really good at modelling identical hosts. But with virtual hosts instantly available, this left those thinking at a higher level wanting more. Chef in particular implemented a nice set of tools and functionality to allow this high level &#8220;service&#8221; definition with their knife tools and simple ruby API.</p>
<p>But how easy are Chef&#8217;s cookbooks to share and use without modification?<span id="more-414"></span> How easy are they to integrate together? Puppet has modules that are also capable of similar functionality, and the recent integration of Mcollective, plus puppet Faces, has certainly added a lot of the same things Chef had to support this kind of application modelling, but again, the modules seem to require a lot of convention and assumption, and tweaking to get useful.</p>
<p>Its my opinion, that this is very much like the way tarballs+autoconf became the de-facto standard for distributing free software. It was *so much* better than writing a Makefile by hand, and it achieved an enormous amount of portability, so developers adopted it rapidly. In fact, it is still the dominant way to distribute portable open source applications.</p>
<p>But at some point, the limitations of this became clear. There was a need for something more concise, that could distribute both the source, and binaries, built for a platform. There was some limited<a href="http://slackware.org"> early success with tarballs</a> built by convention. But then, Enter RPM and DPKG. These included ways to express facts about software, like its dependencies, architecture, and the revisions made to it to work on the target platform. This allowed distributors of software to more easily maintain their systems, and enabled users to manage the software in their environments.</p>
<p>At that point, some smart guy figured out that we should be able to download and automatically configure all of the software needed for one application to work properly, just from its packaging information. To my mind, apt-get was my first experience with this, though FreeBSD ports authors may disagree there. Either way, this made it very easy for admins and users to install software without spending hours in the 7 levels of dependency hell.</p>
<p>In many ways, Service Orchestration is a way of bringing the benefits of packaging to the cloud. It should allow us to build out our cloud in a sane way, taking advantage of the knowledge that has been gained by others. For the bits that we need to finely tune, it should step aside and allow that without compromising the system.</p>
<p>Ensemble is an implementation of this idea, and <a href="https://launchpad.net/principia">Principia</a> is a collection of &#8220;<a href="https://ensemble.ubuntu.com/docs/formula.html">Formulas</a>&#8221; for Ensemble. They are tightly coupled to Ubuntu, as they are in many ways meant to be the dpkg and apt-get for Ubuntu in the cloud.</p>
<p>Its pretty easy to try out Ensemble and Principia on Ubuntu. Right now you&#8217;ll need an EC2 account with an access key setup, though we&#8217;re working on making this work with just your local machine for rapid development.</p>
<p><em><strong>Its been pointed out to me that the version of principia-tools that was available at the time of this writing didn&#8217;t include /usr/share/principia-tools/tests. I&#8217;ve uploaded a fixed version to the ensemble PPA, so if you tried these instructions and failed, please try updating principia-tools. If that fails, you can get the tests with bzr branch lp:principia-tools.</strong><br />
</em></p>
<p><code><br />
sudo add-apt-repository ppa:ensemble/ppa<br />
sudo apt-get update<br />
sudo apt-get install principia-tools<br />
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxx<br />
export AWS_SECRET_KEY_ID=0123456789ABCDEF<br />
ensemble bootstrap<br />
principia getall /some/path/for/formulas<br />
/usr/share/principia-tools/tests/mediawiki.sh /some/path/for/formulas<br />
</code></p>
<p>What does this give you, well it should give you a 7 node mediawiki cluster of t1.micro&#8217;s in the us-east-1 region of EC2. I just ran it and now I have this:</p>
<pre>machines:
  0: {dns-name: ec2-50-19-158-109.compute-1.amazonaws.com, instance-id: i-215dd84f}
  1: {dns-name: ec2-50-17-16-228.compute-1.amazonaws.com, instance-id: i-8d58dde3}
  2: {dns-name: ec2-72-44-49-114.compute-1.amazonaws.com, instance-id: i-9558ddfb}
  3: {dns-name: ec2-50-19-47-106.compute-1.amazonaws.com, instance-id: i-6d5bde03}
  4: {dns-name: ec2-174-129-132-248.compute-1.amazonaws.com, instance-id: i-7f5bde11}
  5: {dns-name: ec2-50-19-152-136.compute-1.amazonaws.com, instance-id: i-755bde1b}
  6: {dns-name: '', instance-id: i-4b5bde25}
services:
  demo-wiki:
    formula: local:mediawiki-62
    relations: {cache: wiki-cache, db: wiki-db, website: wiki-balancer}
    units:
      demo-wiki/0:
        machine: 2
        relations: {}
        state: null
      demo-wiki/1:
        machine: 6
        relations: {}
        state: null
  wiki-balancer:
    formula: local:haproxy-13
    relations: {reverseproxy: demo-wiki}
    units:
      wiki-balancer/0:
        machine: 4
        relations: {}
        state: null
  wiki-cache:
    formula: local:memcached-10
    relations: {cache: demo-wiki}
    units:
      wiki-cache/0:
        machine: 3
        relations: {}
        state: null
      wiki-cache/1:
        machine: 5
        relations: {}
        state: null
  wiki-db:
    formula: local:mysql-93
    relations: {db: demo-wiki}
    units:
      wiki-db/0:
        machine: 1
        relations: {}
        state: null</pre>
<p>At the top you see the machines that ensemble spun up in EC2 in the &#8216;machines&#8217; section. The numbers there correspond to the &#8216;machine: #&#8217; in the service/units definitions below. If you look through, you&#8217;ll see above that wiki-balancer is machine 4, which has a hostname of ec2-174-129-132-248.compute-1.amazonaws.com. If you go to that hostname, once all relations are up (I like to use &#8216;watch ensemble status&#8217; to see when this happens), you should see a working mediawiki. But not just a working mediawiki, a scalable one. If you want to pour on the traffic, spin up 3 more demo-wiki&#8217;s to handle the app server load:</p>
<p><code><br />
ensemble add-unit demo-wiki<br />
ensemble add-unit demo-wiki<br />
ensemble add-unit demo-wiki<br />
</code></p>
<p>These will of course take a minute or two to spin up. Once they&#8217;re ready they&#8217;ll show up in the status output:</p>
<pre>services:
  demo-wiki:
    formula: local:mediawiki-62
    relations: {cache: wiki-cache, db: wiki-db, website: wiki-balancer}
    units:
      demo-wiki/0:
        machine: 2
        relations:
          cache: {state: up}
          db: {state: up}
          website: {state: up}
        state: started
      demo-wiki/1:
        machine: 6
        relations:
          cache: {state: up}
          db: {state: up}
          website: {state: up}
        state: started
      demo-wiki/2:
        machine: 7
        relations:
          cache: {state: up}
          db: {state: up}
          website: {state: up}
        state: started
      demo-wiki/3:
        machine: 8
        relations:
          cache: {state: up}
          db: {state: up}
          website: {state: up}
        state: started
      demo-wiki/4:
        machine: 9
        relations:
          cache: {state: up}
          db: {state: up}
          website: {state: up}
        state: started
</pre>
<p>How about a little test then? After I got to this point, I logged in as WikiSysop (change the password folks! its change-me) and imported the Wikipedia exports for &#8220;Ubuntu&#8221; and &#8220;EC2&#8243;. After that I used harvestman to spider the site and then saved all the urls in a file, urls.txt. Alright! Now lets fire up *siege* from a machine outside the cluster, but in the same availability zone / security group (so at least we&#8217;re only dealing with EC2&#8242;s latency and not my net connection), and see if we can take this cluster down!</p>
<p><code><br />
$ siege -i -c 5 -f urls.txt<br />
...<br />
Transactions:		         563 hits<br />
Availability:		      100.00 %<br />
Elapsed time:		       95.58 secs<br />
Data transferred:	        2.64 MB<br />
Response time:		        0.35 secs<br />
Transaction rate:	        5.89 trans/sec<br />
Throughput:		        0.03 MB/sec<br />
Concurrency:		        2.04<br />
Successful transactions:         544<br />
Failed transactions:	           0<br />
Longest transaction:	       13.54<br />
Shortest transaction:	        0.00<br />
</code></p>
<p>This is, btw, the best run I got out of t1.micro&#8217;s. Sometimes it would get quite ugly:</p>
<p><code><br />
Transactions:		         892 hits<br />
Availability:		       99.55 %<br />
Elapsed time:		      221.69 secs<br />
Data transferred:	        3.64 MB<br />
Response time:		        0.61 secs<br />
Transaction rate:	        4.02 trans/sec<br />
Throughput:		        0.02 MB/sec<br />
Concurrency:		        2.45<br />
Successful transactions:         849<br />
Failed transactions:	           4<br />
Longest transaction:	       27.41<br />
Shortest transaction:	        0.00<br />
</code></p>
<p>Lets try the whole thing over with m1.small. First I edit ~/.ensemble/environments.yaml and add an override for the default-instance-type:</p>
<p><code><br />
ensemble: environments</code></p>
<pre>
environments:
  sample:
    type: ec2
    default-instance-type: m1.small
    control-bucket: ensemble-12345678901234567890
    admin-secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
</pre>
<p>Then I re-run the whole test:</p>
<p><code><br />
Transactions:		         290 hits<br />
Availability:		       98.98 %<br />
Elapsed time:		       81.79 secs<br />
Data transferred:	        0.78 MB<br />
Response time:		        0.53 secs<br />
Transaction rate:	        3.55 trans/sec<br />
Throughput:		        0.01 MB/sec<br />
Concurrency:		        1.89<br />
Successful transactions:         277<br />
Failed transactions:	           3<br />
Longest transaction:	        1.50<br />
Shortest transaction:	        0.00<br />
</code></p>
<p>Oops! I forgot to add my 3 extra nodes. Note that these two m1.smalls are already almost keeping up. Now as I add these, I keep siege running. Its pretty cool to watch the response times drop as nodes come online to carry some of the load.</p>
<p>Now with 5 m1.small&#8217;s:</p>
<p><code><br />
Transactions:		         273 hits<br />
Availability:		      100.00 %<br />
Elapsed time:		       54.27 secs<br />
Data transferred:	        0.99 MB<br />
Response time:		        0.47 secs<br />
Transaction rate:	        5.03 trans/sec<br />
Throughput:		        0.02 MB/sec<br />
Concurrency:		        2.38<br />
Successful transactions:         260<br />
Failed transactions:	           0<br />
Longest transaction:	       19.92<br />
Shortest transaction:	        0.00<br />
</code></p>
<p>And with higher concurrency raised from 5 to 10:</p>
<p><code><br />
Transactions:		         327 hits<br />
Availability:		      100.00 %<br />
Elapsed time:		       42.20 secs<br />
Data transferred:	        1.30 MB<br />
Response time:		        0.66 secs<br />
Transaction rate:	        7.75 trans/sec<br />
Throughput:		        0.03 MB/sec<br />
Concurrency:		        5.12<br />
Successful transactions:         318<br />
Failed transactions:	           0<br />
Longest transaction:	       25.51<br />
Shortest transaction:	        0.00<br />
</code></p>
<p>And now if we add 2 more, for a total of 7 nodes, concurrency of 10 gets even better:</p>
<p><code><br />
Transactions:		         531 hits<br />
Availability:		      100.00 %<br />
Elapsed time:		       53.37 secs<br />
Data transferred:	        1.75 MB<br />
Response time:		        0.44 secs<br />
Transaction rate:	        9.95 trans/sec<br />
Throughput:		        0.03 MB/sec<br />
Concurrency:		        4.35<br />
Successful transactions:         507<br />
Failed transactions:	           0<br />
Longest transaction:	       15.49<br />
Shortest transaction:	        0.00<br />
</code></p>
<p>And with 2 more (total of 9 units in demo-wiki serving the app):</p>
<p><code><br />
Transactions:		         354 hits<br />
Availability:		      100.00 %<br />
Elapsed time:		       34.41 secs<br />
Data transferred:	        1.23 MB<br />
Response time:		        0.41 secs<br />
Transaction rate:	       10.29 trans/sec<br />
Throughput:		        0.04 MB/sec<br />
Concurrency:		        4.22<br />
Successful transactions:         337<br />
Failed transactions:	           0<br />
Longest transaction:	       11.45<br />
Shortest transaction:	        0.00<br />
</code></p>
<p>Anyway, this isn&#8217;t a Mediawiki benchmark. This is to show you how easy it is to scale up and down in response to load with Ensemble. We all know that scaling out works, these graphs show it nicely:</p>
<p><img src="http://fewbar.com.s3.amazonaws.com/wp-content/uploads/2011/06/resptime.png" alt="Response Time" /><br />
<img src="http://fewbar.com.s3.amazonaws.com/wp-content/uploads/2011/06/tps.png" alt="Transactions per Second" /></p>
<p>Notice how the transactions/second went up all the time, but the response time went up drastically with the jump in concurrency. This is where you need to have the ability to scale quickly, and where, if you can live with the other limitations of EC2 or any other IaaS provider, the cloud should actually win you business, since better response time means more happy users.</p>
<p>Now that my siege is over, I can safely remove the unnecessary units one by one with &#8216;ensemble remove-unit demo-wiki/9&#8242;, etc. etc. There&#8217;s still a lot of room for sugar to be added. We could say &#8220;ensemble resize-service demo-wiki 5&#8243; and it might just pick 5 to keep and remove the rest, or add 3 to fulfill the request. There are also a ton of other ideas just bubbling up that are really exciting.</p>
<p>Come say hi and hack on ensemble with us in Freenode, #ubuntu-ensemble and on the mailing list on <a href="https://lists.ubuntu.com/mailman/listinfo/ensemble">the mailing list</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://fewbar.com/2011/06/so-what-is-ensemble-anyway/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Gearman K.O.&#8217;s mysql to solr replication</title>
		<link>http://fewbar.com/2010/03/gearman-replicate-mysql-to-solr/</link>
		<comments>http://fewbar.com/2010/03/gearman-replicate-mysql-to-solr/#comments</comments>
		<pubDate>Wed, 24 Mar 2010 05:47:36 +0000</pubDate>
		<dc:creator>clint</dc:creator>
				<category><![CDATA[MySQL]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[Scalability]]></category>
		<category><![CDATA[gearman]]></category>
		<category><![CDATA[opensource]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://fewbar.com/?p=154</guid>
		<description><![CDATA[Ding ding ding.. in this corner, wearing black shorts and a giant schema, we have over 11 million records in MySQL with a complex set of rules governing which must be searchable and which must not be. And in that corner, we have the contender, a kid from the back streets, outweighed and out reached [...]]]></description>
			<content:encoded><![CDATA[<p>Ding ding ding.. in this corner, wearing black shorts and a giant schema, we have over 11 million records in MySQL with a complex set of rules governing which must be searchable and which must not be. And in that corner, we have the contender, a kid from the back streets, outweighed and out reached by all his opponents, but still victorious in the queue shootout, with just open source, and 12 patch releases.. written in C, its <b><a href="http://gearman.org">gearman</a></b>!</p>
<p><a href="http://fewbar.com.s3.amazonaws.com/wp-content/uploads/2010/03/ko-mike-tyson.png"><img src="http://fewbar.com.s3.amazonaws.com/wp-content/uploads/2010/03/ko-mike-tyson.png" alt="" title="ko-mike-tyson" width="500" height="437" class="alignnone size-full wp-image-155" /></a><br />
<span id="more-154"></span></p>
<p>I&#8217;m pretty excited today, as I&#8217;m preparing to go live with the first real, high load application of Gearman that I&#8217;ve written. What is it you say? Well it is a simple trigger based replicator from mysql to <a href="http://lucene.apache.org/solr/">SOLR</a>.</p>
<p>I should say (because I know some of my colleagues read this blog) that I don&#8217;t actually believe in this design. Replication using triggers seems fraught with danger. It totally makes sense if you have a giant application and can&#8217;t track down everywhere that a table is changed. However, if your app is simple and properly abstracted, hopefully you know the 1 or 2 places that write to the table.</p>
<p>I should also say that I really can&#8217;t reveal all of the details. The general idea is pretty simple. Basically we have a trigger that dumps a primary key into gearman via the <a href="https://launchpad.net/gearman-mysql-udf">gearman MySQL UDFs</a>. The idea is just to tell a gearman worker &#8220;look at this record in that table&#8221;.</p>
<p>Once the worker picks it up, it applies some logic to the record.. &#8220;should this be searchable or not&#8221;. If the answer is yes it should be searchable, the worker pushes the record into SOLR. If not, the worker will make sure it is not in solr.</p>
<p>This at least is pretty simple. The end result is a system where we can rebuild the search index in parallel using multiple CPU&#8217;s (thank you to solr/lucene for being able to update indexes concurrently and efficiently btw). This is done by pushing all of the records in the table into the queue at once.</p>
<p>Anyway, gearmand is performing like a champ, libgearman and the gearman pecl module are doing great. I&#8217;m just really happy to see gearman rolled out in production, as I really do think it has that nice mix of simplicity and performance. I love the commandline client which makes it easy to write scripts to inject things into queues, or query workers.  This allows me to access a worker like this:</p>
<p><code>$ gearman -h gearmanbox -f all_workers -s<br />
Known Workers: 11</p>
<p>boxname_RealTimeUpdate_Queue_TriggerWorker_1 jobs=627366,restarts=0,memory_MB=4.27,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700<br />
boxname_RealTimeUpdate_Queue_Subject_13311 jobs=304134,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:58 -0700<br />
boxname_RealTimeUpdate_Queue_Subject_13306 jobs=606126,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700<br />
boxname_RealTimeUpdate_Queue_Subject_13314 jobs=576714,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700<br />
boxname_RealTimeUpdate_Queue_Subject_13342 jobs=294846,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700<br />
boxname_RealTimeUpdate_Queue_Subject_13347 jobs=376998,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700<br />
boxname_RealTimeUpdate_Queue_Subject_13359 jobs=470508,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:58 -0700<br />
boxname_RealTimeUpdate_Queue_Subject_13364 jobs=403182,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:58 -0700<br />
boxname_RealTimeUpdate_Property_SolrPublish_ jobs=219630,restarts=0,memory_MB=6.19,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700<br />
boxname_RealTimeUpdate_Queue_TriggerWorker_2 jobs=393642,restarts=0,memory_MB=4.27,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700<br />
boxname_RealTimeUpdate_Property_SolrBatchPub jobs=6,restarts=0,memory_MB=6.23,lastcheckin=Tue, 23 Mar 2010 22:37:28 -0700</code></p>
<p>Brilliant.. no need for html or HTTP.. just a nice simple commandline interface.</p>
<p>I think gearman still has a ways to go. I&#8217;d really like to see some more administration added to it. Deleting empty queues and quickly flushing all queues without restarting gearmand would be nice to haves. We&#8217;ll see what happens going forward, but for not, thanks so much to the gearman team (especially Eric Day who showed me gearman, and Brian Aker for pushing hard to release v0.12).</p>
<p>w00t!</p>
]]></content:encoded>
			<wfw:commentRss>http://fewbar.com/2010/03/gearman-replicate-mysql-to-solr/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>TokyoTyrant &#8211; MemcacheDB, but without the BDB?</title>
		<link>http://fewbar.com/2009/06/tokyotyrant-memcachedb-but-without-the-bdb/</link>
		<comments>http://fewbar.com/2009/06/tokyotyrant-memcachedb-but-without-the-bdb/#comments</comments>
		<pubDate>Thu, 04 Jun 2009 06:40:26 +0000</pubDate>
		<dc:creator>clint</dc:creator>
				<category><![CDATA[Memcache]]></category>
		<category><![CDATA[Scalability]]></category>
		<category><![CDATA[benchmarks]]></category>
		<category><![CDATA[drizzle]]></category>
		<category><![CDATA[memcachedb]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[tokyocabinet]]></category>
		<category><![CDATA[tokyotyrant]]></category>

		<guid isPermaLink="false">http://fewbar.com/?p=85</guid>
		<description><![CDATA[Anyway, the next thing I mentioned was that we had also tried MemcacheDB with some success. Brian wasn't exactly impressed with MemcacheDB, and immediately suggested that we should be using <a href="http://tokyocabinet.sourceforge.net/tyrantdoc/">Tokyo Tyrant</a> instead. I had heard of Tokyo Cabinet, the new hotness in local key/value storage and retrieval, but what is this Tyrant you speak of?]]></description>
			<content:encoded><![CDATA[<p>This past April I was riding in a late model, 2 door rental car with an interesting trio for sure. On my right sat <a href="http://capttofu.livejournal.com/">Patrick Galbraith</a>, maintainer of DBD::mysql and author of the Federated storage engine. Directly in front of me manning the steering wheel (for those of you keen on spatial description, you may have noted at this point that its most likely I was seated in the back, left seat of a car which is designed to be driven on the right side of the road. EOUF [end of useless fact]), David Axmark, co-founder of MySQL. Immediately to his right sat <a href="http://krow.net/">Brian Aker</a>, of (most recently) Drizzle fame.<br />
<span id="more-85"></span><br />
This was one of those conversations that I felt grossly unprepared for. It was the 2009 MySQL User&#8217;s conference, and  Patrick and I had been hacking on <a href="https://launchpad.net/dbd-drizzle">DBD::drizzle</a> for most of the day. We had it 98% of the way there and were in need of food, so we were joining the Drizzle dev team for gourmet pizza.</p>
<p>As we navigated from the Santa Clara conference center to Mountain View&#8217;s quaint downtown, Brian, Patrick, and I were discussing memcached stuff. I mentioned <a href="http://fewbar.com/2008/12/memcached-and-mogile-form-memcachemegazord/">my idea, and subsequent implementation of the Mogile+Memcached method for storing data more reliably</a> in memcached. I knew in my head why we had chosen to read from all of the replica servers, not just the first one that worked, but I forgot (The reason, btw, is that if one of the servers had missed a write for some reason, you might get out-of-date data). I guess I was a little overwhelmed by Brian&#8217;s mountain of experience w/ memcached.</p>
<p>Anyway, the next thing I mentioned was that we had also tried MemcacheDB with some success. Brian wasn&#8217;t exactly impressed with MemcacheDB, and immediately suggested that we should be using <a href="http://tokyocabinet.sourceforge.net/tyrantdoc/">Tokyo Tyrant</a> instead. I had heard of Tokyo Cabinet, the new hotness in local key/value storage and retrieval, but what is this Tyrant you speak of?</p>
<p>I&#8217;ve been playing with Tokyo Tyrant ever since, and advocating for its usage at Adicio. Its pretty impressive. In addition to speaking memcached protocol, it apparently speaks HTTP/WEBDAV  too. The ability to select hash, btree, and a host of other options is nice, though I&#8217;m sure some of these are available as obscure options to berkeleydb as well.</p>
<p>Anyway, I was curious what performance was like, so I did some tests on my little Xen instance, and came up with pretty graphs.</p>
<p><a href="http://fewbar.com/wp-content/uploads/2009/06/tokyotyrantvsmemcachedb1.gif"><img src="http://fewbar.com/wp-content/uploads/2009/06/tokyotyrantvsmemcachedb1.gif" alt="tokyotyrantvsmemcachedb1" title="tokyotyrantvsmemcachedb1" width="465" height="472" class="alignnone size-full wp-image-92" /></a></p>
<p>I used the excellent <a href="http://code.google.com/p/brutis/">Brutis</a> tool to run these benchmarks using the most interesting platform for me at the moment.. which would be, php with the pecl Memcache  module.</p>
<p>These numbers were specifically focused on usage that is typical to MemcacheDB. A wide range of keys (in this case, 10000 is &#8220;wide&#8221; since the testing system is very small), not-small items (2k or so), and lower write:read ratio (1:50). I had the tests restart each daemon after each run, and these numbers are the results of the average of 3 runs each test.</p>
<p>I also tried these from another xen instance on the same LAN, and things got a lot slower. Not really sure why as latency is in the sub-millisecond range.. but maybe Xen&#8217;s networking just isn&#8217;t very fast. Either way, the numbers for each combination didn&#8217;t change much.</p>
<p>What I find interesting is that memachedb in no-sync mode actually went faster than memached. Of course, in nosync mode, memcachedb is just throwing data at the disk. It doesn&#8217;t have to maintain LRU or slabs or anything.</p>
<p>Tokyo Tyrant was very consistent, and used *very* little RAM in all instances. I do recall reading that it compresses data. Maybe thats a default? Anyway, tokyo tyrant also was the most CPU hungry of the bunch, so I have to assume having more cores might have resulted in much better results.</p>
<p>I&#8217;d like to get together a set of 3 or 4 machines to test multiple client threads, and replication as well. Will post that as part 2 when I pull it together. For now, it looks like.</p>
<p>In case anybody wants to repeat these tests, I&#8217;ve included <a href="http://spamaps.org/files/tt-mdb-memcache-tests.tgz">the results, and the scripts used to generate them in this tarball</a>.</p>
<p>&#8211; Additional info, 6/4/2009<br />
Another graph that some might find interesting, is this one detailing CPU usage. During all the tests, brutis used about 60% of the CPU available on the machine, so 40% is really 100%:</p>
<p><a href="http://fewbar.com/wp-content/uploads/2009/06/tokyotyranttests_cpu.gif"><img src="http://fewbar.com/wp-content/uploads/2009/06/tokyotyranttests_cpu.gif" alt="tokyotyranttests_cpu" title="tokyotyranttests_cpu" width="428" height="385" class="alignnone size-full wp-image-98" /></a></p>
<p>This tells me that the CPU was the limiting factor for Tokyo Tyrant, and with a multi-core machine, we should see huge speed improvements. Stay tuned for those tests!</p>
]]></content:encoded>
			<wfw:commentRss>http://fewbar.com/2009/06/tokyotyrant-memcachedb-but-without-the-bdb/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Parallel mysql replication?</title>
		<link>http://fewbar.com/2009/06/parallel-mysql-replication/</link>
		<comments>http://fewbar.com/2009/06/parallel-mysql-replication/#comments</comments>
		<pubDate>Tue, 02 Jun 2009 19:08:48 +0000</pubDate>
		<dc:creator>clint</dc:creator>
				<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Scalability]]></category>
		<category><![CDATA[drizzle]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[replication]]></category>

		<guid isPermaLink="false">http://fewbar.com/?p=80</guid>
		<description><![CDATA[Its always been a dream of mine. I&#8217;ve posted about parallel replication on Drizzle&#8217;s mailing list before. I think when faced with the problem of a big, highly concurrent master, and scaling out reads simply with lower cost slaves, this is going to be the only way to go. So today I was really glad [...]]]></description>
			<content:encoded><![CDATA[<p>Its always been a dream of mine. I&#8217;ve <a href="https://lists.launchpad.net/drizzle-discuss/msg03988.html">posted about parallel replication</a> on Drizzle&#8217;s mailing list before. I think when faced with the problem of a big, highly concurrent master, and scaling out reads simply with lower cost slaves, this is going to be the only way to go.</p>
<p>So today I was really glad to see that somebody is trying out the idea. Seppo Jaakola from <a href="http://www.codership.com/">&#8220;Codership&#8221;</a>, who I&#8217;ve never heard of before today, <a href="https://lists.launchpad.net/drizzle-discuss/msg04214.html">posted a link</a> to an article on his blog about his <a href="http://www.codership.com/content/parallel-applying">experimentation with parallel replication slaves</a>. The findings are pretty interesting.<br />
<span id="more-80"></span><br />
I hope that he&#8217;ll be able to repeat his tests with a real world setup. The software they&#8217;ve written seems to have the right idea. The biggest issue I have with the tests is that  the tests were run on tiny hardware. Hyperthreading? Single disks? Thats not really the point of having parallel replication slaves.</p>
<p>The idea is that you have maybe a gigantic real time write server for OLTP. This beast may have lots of medium-power CPU cores, and an obscene amount of RAM, and a lot of battery backed write cache for writes.</p>
<p>Now you know that there are tons of reads that shouldn&#8217;t ever be done against this server. You drop a few replication slaves in, and you realize that you need a box with as much disk storage as your central server, and probably just as much write cache. Pretty soon scaling out those reads is just not very cost effective.</p>
<p>However, if you could have lots of CPU cores, and lots of cheap disks, you could dispatch these writes to be done in parallel, and you wouldn&#8217;t need expensive disk systems or lots of RAM for each slave.</p>
<p>So, the idea is not to make slaves faster in a 1:1 size comparison. Its to make it easier for a cheap slave to keep up with a very busy, very expensive master.</p>
<p>I do see where another huge limiting factor is making sure things synchronize in commit order. I think thats an area where a lot of time needs to be spent on optimization. The order should already be known so that the commiter thread is just waiting for the next one in line, and if the next 100 are already done it can just rip through them quickly, not signal them that they can go. Something like this seems right:</p>
<p><code><br />
id=first_commit_id();<br />
while(wait_for_commit(id)) {<br />
  commit(id);<br />
  id++;<br />
}<br />
</code></p>
<p>I applaud the efforts of Codeship, and I hope they&#8217;ll continue the project and maybe ship something that will rock all our worlds.</p>
]]></content:encoded>
			<wfw:commentRss>http://fewbar.com/2009/06/parallel-mysql-replication/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MemcacheDB fault tolerance procedures</title>
		<link>http://fewbar.com/2009/03/memcachedb-fault-tolerance-procedures/</link>
		<comments>http://fewbar.com/2009/03/memcachedb-fault-tolerance-procedures/#comments</comments>
		<pubDate>Wed, 25 Mar 2009 18:07:24 +0000</pubDate>
		<dc:creator>clint</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[fault tolerance]]></category>
		<category><![CDATA[heartbeat]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[memcachedb]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[reliability]]></category>
		<category><![CDATA[Scalability]]></category>

		<guid isPermaLink="false">http://fewbar.com/?p=46</guid>
		<description><![CDATA[It semeed so simple, just setup two memcachedb instances and point them at eachother. Instant fault tolerance, Right? If only it were so simple! Its not entirely clear from the documentation how to setup memcachedb for fault tolerance. Here&#8217;s the procedures I&#8217;ve found useful. Set up replication right. With all due respect to Steve Chu, [...]]]></description>
			<content:encoded><![CDATA[<p>It semeed so simple, just setup two memcachedb instances and point them at eachother. Instant fault tolerance, Right? If only it were so simple!</p>
<p>Its not entirely clear from the documentation how to setup memcachedb for fault tolerance. Here&#8217;s the procedures I&#8217;ve found useful.<br />
<span id="more-46"></span></p>
<ul>
<li><strong>Set up replication right</strong>. With all due respect to Steve Chu, The docs aren&#8217;t really clear on how to setup replication. Its much simpler than it looks. Just run MemcacheDB as you would if it were standalone, but then add a combination of these 3 options:
<ul>
<li>You must have a -R line if you want to participate in replication. This is your hostname and port that listens for connections from other machines for replication. It is the same value that should be listed in every other machine&#8217;s -O.</li>
<li>a -O for *every* other machine that may want to replicate to/from this machine. I am sure there are situations where you won&#8217;t need these, but it makes re-syncing and elections more predictable. You won&#8217;t be able to re-sync &#8220;live&#8221; after a failure without -O options.</li>
<li>-M/-S are not required. If you start n machines without -M or -S, but with appropriate -R and -O lines, they will arbitrarily elect a master. If you run them with -M and -S, then the -M box will just be pushy and always elect itself the master, and the -S boxes will, likewise, always try to defer to slave status.</li>
<li>Lets say we wanted to listen for memcache protocol on port 45000 on host &#8216;node1&#8242; and replicate to &#8216;node2&#8242;</li>
<li>Standalone: <code>memcachedb -p 45000 -H /home/memdb/data -u memdb -N</code></li>
<li>Replication w/ elected master: <code>memcachedb -p 45000 -H /home/memdb/data -u memdb -N -R node1:46000 -O node2:46000</code></li>
<li>Replication Master: <code>memcachedb -p 45000 -H /home/memdb/data -u memdb -N -R node1:46000 -O node2:46000 -M</code></li>
</ul>
</li>
<li><strong>Only the current master can accept writes</strong>. You can see which machine is the master with the &#8216;stats rep&#8217; command. In v1.2.1 its shown as an environment id. Below st_env_id and st_master are the same, so this is the master:<br />
<code><br />
stats rep<br />
STAT st_bulk_fills 0<br />
STAT st_bulk_overflows 0<br />
STAT st_bulk_records 11<br />
STAT st_bulk_transfers 3<br />
STAT st_client_rerequests 0<br />
STAT st_client_svc_miss 0<br />
STAT st_client_svc_req 0<br />
STAT st_dupmasters 0<br />
STAT st_egen 3<br />
STAT st_election_cur_winner 2147483647<br />
STAT st_election_gen 0<br />
STAT st_election_lsn 1/28<br />
STAT st_election_nsites 0<br />
STAT st_election_nvotes 1<br />
STAT st_election_priority 100<br />
STAT st_election_sec 5<br />
STAT st_election_status 0<br />
STAT st_election_tiebreaker 3676766282<br />
STAT st_election_usec 69747<br />
STAT st_election_votes 0<br />
STAT st_elections 1<br />
STAT st_elections_won 1<br />
STAT st_env_id 2147483647<br />
STAT st_env_priority 100<br />
STAT st_gen 2<br />
STAT st_log_duplicated 0<br />
STAT st_log_queued 0<br />
STAT st_log_queued_max 0<br />
STAT st_log_queued_total 0<br />
STAT st_log_records 0<br />
STAT st_log_requested 0<br />
STAT st_master 2147483647<br />
STAT st_master_changes 0<br />
STAT st_max_lease_sec 0<br />
STAT st_max_lease_usec 0<br />
STAT st_max_perm_lsn 0/0<br />
STAT st_msgs_badgen 0<br />
STAT st_msgs_processed 5<br />
STAT st_msgs_recover 0<br />
STAT st_msgs_send_failures 2<br />
STAT st_msgs_sent 10<br />
STAT st_newsites 0<br />
STAT st_next_lsn 1/8916<br />
STAT st_next_pg 0<br />
STAT st_nsites 2<br />
STAT st_nthrottles 0<br />
STAT st_outdated 0<br />
STAT st_pg_duplicated 0<br />
STAT st_pg_records 0<br />
STAT st_pg_requested 0<br />
STAT st_startsync_delayed 0<br />
STAT st_startup_complete 0<br />
STAT st_status 2<br />
STAT st_txns_applied 0<br />
STAT st_waiting_lsn 0/0<br />
STAT st_waiting_pg 0<br />
END<br />
</code><br />
However, its much simpler, I think, to just try and store a value on an instance. If you get &#8220;STORED&#8221; back, then this is the master. If you get NOT_STORED back, this is a slave. If it blocks (timeouts are hard in simple scripts, I know.. perldoc -f alarm), you are in a &#8220;DOWN&#8221; state. The danger here is one of split brain where both nodes thing they&#8217;re the master.. but.. if they&#8217;re not talking, you have bigger problems!</li>
<li><strong>Out of sync slaves can&#8217;t READ either!</strong> This one bit us just the other day. Something ocurred where our slave wasn&#8217;t able to retrieve the latest log entries from the master. Because of this, it was reporting errors in replication. During this time, *all* commands blocked. We were relying on basic round-robin DNS for failover, thinking that memcachedb was simple enough, it was either &#8220;up&#8221; or &#8220;down&#8221;. Unfortunately, it was stalled on one box, so everything that hit that box blocked and timed out until we firewalled the port so connections wouldn&#8217;t succeed. We eventually had to stop the instance, copy a db_hotbackup from the master, then start it again. This still had to catch up from the point at which the db_hotbackup copies logs were checkpointed, which was (because we&#8217;re on v1.0.3) many hours before. While it was catching up, all commands (even stats commands.. which is disappointing..) blocked.
</li>
<li><strong>Use a load balancer, not round robin</strong>. With that said, a load balancer is a far better solution then round robin. In this case, because the box was &#8220;up&#8221;, but failing to respond, we were at the mercy of the pecl memcache module&#8217;s definition of what was &#8220;up&#8221; or &#8220;down&#8221; for reads. A load balancer separates this logic out into monitors so the code can just connect to a virtual IP, or use some list of servers it is given.</li>
<li><strong>Even better.. just use a floating IP</strong>. MemcacheDB seems to scale to ridiculous levels with reads. Like, 400:1 read:write performance. Do you really need lots of slaves? Just having an IP that follows the master will give you fault tolerance. Its easy to determine if a box is the master. You can even do a &#8216;rep_set_priority 500&#8242; to make sure a box stays the master as long as it has the IP. If you&#8217;re running on Linux, Good old <a href="http://www.linux-ha.org/">Heartbeat</a> is perfect for this. If you need to scale past the write capabilities of one box, then partitioning by using a stable hash algorithm on the keys is a far better solution than master/slave replication, and is already built in to pretty much every memcache client.</li>
<li><strong>Be careful with db_archive/db_checkpoint</strong>. This is mostly regarding v1.0.3, as I don&#8217;t know the impact of these commands on v1.1 or 1.2. However, it would seem that even with a replication policy of &#8220;ACK_ONE&#8221;, its still possible to purge logs that the slave needs. This may or may not be true (something else could have gone wrong) but it seems that running db_checkpoint/db_archive too aggressively seems to have broken our replication. There&#8217;s no reason to purge logs too often, so be wary when doing so.</li>
</ul>
<p>Hopefully this will help other users who are starting to setup MemcacheDB and need fault tolerance.</p>
]]></content:encoded>
			<wfw:commentRss>http://fewbar.com/2009/03/memcachedb-fault-tolerance-procedures/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Memcached and Mogile Form MemcacheMegaZord!</title>
		<link>http://fewbar.com/2008/12/memcached-and-mogile-form-memcachemegazord/</link>
		<comments>http://fewbar.com/2008/12/memcached-and-mogile-form-memcachemegazord/#comments</comments>
		<pubDate>Sun, 14 Dec 2008 17:21:50 +0000</pubDate>
		<dc:creator>clint</dc:creator>
				<category><![CDATA[Scalability]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[memcached]]></category>
		<category><![CDATA[memcachedb]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[sessions]]></category>

		<guid isPermaLink="false">http://fewbar.com/?p=27</guid>
		<description><![CDATA[So I was starting to play with Memcached for session storage, and I found a fairly big problem with justing memcached in its normal caching mode as a session store. It really just boils down to caching and storing of deterministic data being very different things that only look similar on the surface. So normally, [...]]]></description>
			<content:encoded><![CDATA[<p>So I was starting to play with Memcached for session storage, and I found a fairly big problem with justing memcached in its normal caching mode as a session store. It really just boils down to caching and storing of deterministic data being very different things that only look similar on the surface.<br />
<span id="more-27"></span><br />
So normally, memcached is used in a very clever way by adding a list of servers, and then using a hashing algorithm to pick a server to actually contact based on the key of a get/set request. This allows a ton of scaling out, with minimal moving parts. There&#8217;s no periodic monitor or broadcast protocol to add and remove cluster members to and from pools, so you can just run memcached on a bunch of servers, and use a consistent list across all of your machines to achieve a huge degree of scale out. When a server dies, the code just sees that, and moves on to the next one in the hash algorithm, and all is well.</p>
<p>For caching, this &#8220;failover&#8221; methodology works fine. If I go to set a value in memcached, and the server fails over to the second one, thats ok. The next get to the primary will fail, and get set properly, and the old entry on the secondary will *eventually* get pushed out of the cache.</p>
<p>However, for storing data reliably, this becomes a problem. Lets say there is a scenario where a network cable is bad on one of the memcached servers. 1 in 100 requests fails. With caching, failover will go a little nuts, but its entirely possible nobody will even notice, as results will be cached, data won&#8217;t get stale.. no big deal.</p>
<p>With storage though, this could happen..</p>
<p>- session is created on memache1</p>
<p>- session tries to read from memcache1, and fails.. so new session is created on memache2</p>
<p>- session is then read from memache1</p>
<p>- session is updated on memcache1 with new information</p>
<p>- session fails to read from memcache1, and old session data is read from memacache2, then the set succeeds on memcache1, and the old data is lost.</p>
<p>The point isn&#8217;t really this scenario&#8217;s details, but that this hashing algorithm is vulnerable, even designed to lose data that was written to it. That is the caching paradigm.</p>
<p>As I discussed this with some colleagues, my mind immediately jumped to <a href="http://www.memcachedb.org">MemcacheDB</a>. Maybe that would work for session storage. It has replication, so we could use the traditional active/passive paradigm for it. However, this limits our scale to whatever a single instance of MemcacheDB can handle. Honestly thats probably fine for most sites, as MemcacheDB can probably handle tens of thousands of small writes per second.</p>
<p>However, there are multiple problems. The biggest problem with MemcacheDB is there&#8217;s no easy way (yet, they&#8217;re working on it) to pull keys out of it to do garbage collection. Likewise, session data really doesn&#8217;t need to live for a long time. We just need to be reasonably certain that the data we&#8217;re getting is reasonably new.</p>
<p>If we store the data in *all* of the servers, and if we store a highly accurate (meaning if it takes you milliseconds to complete a request, this timestamp needs to be down to microseconds) timestamp of when the data was given to us (meaning we use the same timestamp for each server) along side it, we can then just read it from all of the servers, and pick the newest one. Ew, that means we are still limited to the scale of one instance of memcached.</p>
<p>Then I had a flash back to the way <a href="http://danga.com/mogilefs/">MogileFS</a> works. It stores data on a number of replica servers. Of course, it also keeps track of where it stored them. But I figured, for sessions, thats a lot of overhead. There&#8217;s an easier way. We can use the <a href="http://www.spiteful.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/">consistent hashing algorithm</a> that the PHP Memcache module uses to pick servers, and just read and write the data from nReplicas servers. If a server fails, we&#8217;ll move on to the next one, and there&#8217;s a reasonable degree of certainty that it will remain the same. If we write stale data to a server and then fail back to it later, we&#8217;re protected by the timestamp rules. The higher nReplicas, the higher the reliability that a server failure won&#8217;t cause issues. I even found <a href="http://paul.annesley.cc/articles/2008/04/30/flexihash-consistent-hashing-php">a PHP implementation of consistent hashing falled FlexiHash</a>.</p>
<p>There&#8217;s one last issue that bugs me about using memcached for sessioning, and the timestamp helps us solve. We recently found that there was a problem where a request would take, say, 45 seconds to complete. At 20 seconds, the user would hit the back button out of frustration. This would put other stuff in the session, then the 45 second request would complete, and write the version of the session it thinks is right to the session store, losing the user&#8217;s new activity.</p>
<p>There are two ways to solve this. One is to introduce locking. This actually isn&#8217;t hard to do with Memcached, it is <a href="http://www.socialtext.net/memcached/index.cgi?faq#emulating_locking_with_the_add_command">described in the memcached faq</a>. However, this introduces something to block or fail on in the read. I think its simpler than that. You simply read the record before you write it, and if it has changed since you read it the first time, you don&#8217;t write it. You just throw the session write away. Obviously the user has moved on, so there&#8217;s no reason to make your update.  If you used locking, the user would still be waiting on the old thread to finish.</p>
<p>Of course, this all hinges on you caring that your session data is accurate, and that you care that users don&#8217;t lose their sessions when one server goes down. If neither of those apply to you, then you can just use sessions like cache.</p>
]]></content:encoded>
			<wfw:commentRss>http://fewbar.com/2008/12/memcached-and-mogile-form-memcachemegazord/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Deciding whether to send reads to slave or master</title>
		<link>http://fewbar.com/2008/10/maximizing-usage-of-mysql-replication-slaves/</link>
		<comments>http://fewbar.com/2008/10/maximizing-usage-of-mysql-replication-slaves/#comments</comments>
		<pubDate>Sat, 04 Oct 2008 17:43:42 +0000</pubDate>
		<dc:creator>clint</dc:creator>
				<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Scalability]]></category>
		<category><![CDATA[application]]></category>
		<category><![CDATA[replication]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://fewbar.com/?p=18</guid>
		<description><![CDATA[There are quite a few articles out there that talk about how to give your application some context and send reads to one server, and writes to another. There are even some mentions of marking your connection &#8220;dirty&#8221; and then sending all reads to the write server. As a first try at scaling things, I [...]]]></description>
			<content:encoded><![CDATA[<p>There are quite a few articles out there that talk about how to give your application some context and send reads to one server, and writes to another. There are even some mentions of marking your connection &#8220;dirty&#8221; and then sending all reads to the write server.</p>
<p>As a first try at scaling things, I recently made a change to our web application&#8217;s data access layer where reads went to a group of readonly slaves. However, if a write was made to a database, a value was put into the user&#8217;s session, saying that the database was dirty, and causing all subsequent reads to go to the master server.<br />
<span id="more-18"></span><br />
This was good as users would use the readonly slaves as long as they hadn&#8217;t changed anything in the database. The real problem though, was that as soon as the user logged in, their account was updated to say that they had logged in, marking that database dirty.</p>
<p>Rather than try to cleverly change this one problem, we changed the &#8220;dirty&#8221; value from a boolean to a timestamp. Whenever the user writes to the database, it records the current time in their session. Then a global timeout is applied to that. This gives the replication slaves time to catch up and get the record that was just changed, then the user will have a consistent view fo their data.</p>
<p>This is great, but I think a further step is to have something publish the actual maximum lag of the slaves into a memcache key, and simply double that value as the timeout. This would allow maximum usage of the readonly slaves and keep the master server busy doing mostly writes.</p>
]]></content:encoded>
			<wfw:commentRss>http://fewbar.com/2008/10/maximizing-usage-of-mysql-replication-slaves/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Can more queries equal a healthier MySQL server?</title>
		<link>http://fewbar.com/2008/08/innodb-concurrency-problems-on-multi-core-boxes-possibly-a-thing-of-the-past/</link>
		<comments>http://fewbar.com/2008/08/innodb-concurrency-problems-on-multi-core-boxes-possibly-a-thing-of-the-past/#comments</comments>
		<pubDate>Sat, 30 Aug 2008 06:21:34 +0000</pubDate>
		<dc:creator>clint</dc:creator>
				<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Scalability]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[concurrency]]></category>
		<category><![CDATA[innodb]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://fewbar.com/?p=14</guid>
		<description><![CDATA[This week was an ugly one for my monster database servers. It should have been triumphant, but oddly enough, I think it shows how prone to mistuning InnoDB on MySQL 5.0 is with multiple cores. This server is a multi-core, high concurrency server. The application has been designed a little bit naively in that it [...]]]></description>
			<content:encoded><![CDATA[<p>This week was an ugly one for my monster database servers. It should have been triumphant, but oddly enough, I think it shows how prone to mistuning InnoDB on MySQL 5.0 is with multiple cores.</p>
<p>This server is a multi-core, high concurrency server. The application has been designed a little bit naively in that it just throws almost all queries at the main db server. Several bits have been designed to scale by not doing that, but unfortunately, huge amounts of functionality were built around those apps to prevent them from scaling.</p>
<p>As a result, we&#8217;ve had to scale up the central database server and its redundant systems significantly. We started with the Proliant DL380 G4 with two Xeon 3.4Ghz CPU&#8217;s and 12GB of RAM, and plenty of disks in an external RAID. As more traffic was added, we moved up to the DL580 servers with 4 Xeon 3.4Ghz and 64GB of RAM. This worked well, but still more traffic, and more data, was coming and the app wasn&#8217;t ready to change significantly. We finally landed on the latest DL580 server, with 1GB of total battery backed write cache, 14 SAS disks, 128GB of RAM, and two quad core Xeon CPU&#8217;s.<br />
<span id="more-14"></span><br />
Some things got better. Writes were now incredibly fast. The server was churning out 1000 queries per second easily. Sometimes during peak times, query response time would suffer, but ultimately, the box was keeping up and performing well. <a href="http://fewbar.com/2008/07/mysql-query-cache-scales-like-a-286-with-turbo-off/">Especially after we turned of query caching</a>. After this week though, I wonder how much of the problem was query caching&#8230; more later.</p>
<p>Anyway, whenever the server would need to have maintenance, some high traffic applications would suffer needlessly for their need of rarely changing data (memcached was out of the question for the complexity and &#8220;realtime&#8221; nature of this data). So we setup a selective replication fanout onto multiple boxes and pointed these apps at that cluster for these queries.</p>
<p>Well the next day, without all of these tiny queries pounding on it, the database server had horrible problems. 400 threads stacked up inside InnoDB &#8220;Waiting for InnoDB queue&#8221;. System resources were fine, but it was clear, InnoDB was having trouble. Queries that normally take 0.75 seconds were taking 300+ seconds, or just never completing. I knew there was real trouble, when killing the thread would result in it just changing state to &#8220;Killed&#8221;, but never dying. Based on what I&#8217;d read in High Performance MySQL, and <a href="http://www.mysqlperformanceblog.com/2006/06/05/innodb-thread-concurrency/">articles like this one</a>, I tried twiddling with innodb_thread_concurrency, innodb_concurrency_tickets, and innodb_thread_sleep_delay. None of them seemed to help, though innodb_thread_concurrency set to a value of about half the CPU cores seemed to delay the problems.</p>
<p>I noticed that we were running MySQL v5.0.51a still. We had planned an upgrade to 5.0.67, which was just recently released, but hadn&#8217;t gotten there yet. I went ahead and upgraded one of the boxes to it, and failed over to it. Instantly things were more healthy, and the health seemed to stay for hours, without any more InnoDB freakouts.</p>
<p>After some research, it would seem that between 5.0.51a and 5.0.67, a lot of really big fixes were made to InnoDB to help it scale up on multi-core machines. The box has been healthy for a couple of days, though there&#8217;s still a lot of work to do removing query load from the server.</p>
<p>But why would a _reduction_ in queries cause concurrency problems? I have a theory, but no real ideas on how to test it.</p>
<p>Before, we were doing 1000 queries per second. Things were healthy. We removed about 400 queries per second from that. These 400 queries were basically instantaneous.. often times returning no results at all and reading from tables and indexes completely stored in the innodb_buffer_pool. But, with query cache turned off, they were still being processed fully by InnoDB. When we removed these tiny queries from the queue imposed by innodb_thread_concurrency, I think we removed the equivalent of spin waits from the queue. These tiny, easy queries were just hard enough to process, to prevent a lot of bigger queries from hitting the queue at the same time. Thats why reducing innodb_thread_concurrency to 4 helped a bit.. with only 4 threads vying for mutexes and CPU resources constantly, InnoDB was able to (sort of) keep up.</p>
<p>My final bit of evidence for this is that we actually, I think, had this problem before with the <a href="http://fewbar.com/2008/07/mysql-query-cache-scales-like-a-286-with-turbo-off/">aforementioned article</a>. Turning off the query cache moved these tiny queries out of the query cache, and into the InnoDB queue, providing the needed pseudo-spin-waits to prevent it from locking in on itself.</p>
<p>I have to wonder if raising innodb_sync_spin_loops to something ridiculously high, like 50000, would have the same effect. Unfortunately, its very hard to test this without dedicating a lot of time to it.</p>
<p>So, in this case, it would seem that more work can, in fact, make the server healthier.</p>
]]></content:encoded>
			<wfw:commentRss>http://fewbar.com/2008/08/innodb-concurrency-problems-on-multi-core-boxes-possibly-a-thing-of-the-past/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.243 seconds -->

