Gearman K.O.’s mysql to solr replication

Ding ding ding.. in this corner, wearing black shorts and a giant schema, we have over 11 million records in MySQL with a complex set of rules governing which must be searchable and which must not be. And in that corner, we have the contender, a kid from the back streets, outweighed and out reached by all his opponents, but still victorious in the queue shootout, with just open source, and 12 patch releases.. written in C, its gearman!


I’m pretty excited today, as I’m preparing to go live with the first real, high load application of Gearman that I’ve written. What is it you say? Well it is a simple trigger based replicator from mysql to SOLR.

I should say (because I know some of my colleagues read this blog) that I don’t actually believe in this design. Replication using triggers seems fraught with danger. It totally makes sense if you have a giant application and can’t track down everywhere that a table is changed. However, if your app is simple and properly abstracted, hopefully you know the 1 or 2 places that write to the table.

I should also say that I really can’t reveal all of the details. The general idea is pretty simple. Basically we have a trigger that dumps a primary key into gearman via the gearman MySQL UDFs. The idea is just to tell a gearman worker “look at this record in that table”.

Once the worker picks it up, it applies some logic to the record.. “should this be searchable or not”. If the answer is yes it should be searchable, the worker pushes the record into SOLR. If not, the worker will make sure it is not in solr.

This at least is pretty simple. The end result is a system where we can rebuild the search index in parallel using multiple CPU’s (thank you to solr/lucene for being able to update indexes concurrently and efficiently btw). This is done by pushing all of the records in the table into the queue at once.

Anyway, gearmand is performing like a champ, libgearman and the gearman pecl module are doing great. I’m just really happy to see gearman rolled out in production, as I really do think it has that nice mix of simplicity and performance. I love the commandline client which makes it easy to write scripts to inject things into queues, or query workers. This allows me to access a worker like this:

$ gearman -h gearmanbox -f all_workers -s
Known Workers: 11

boxname_RealTimeUpdate_Queue_TriggerWorker_1 jobs=627366,restarts=0,memory_MB=4.27,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700
boxname_RealTimeUpdate_Queue_Subject_13311 jobs=304134,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:58 -0700
boxname_RealTimeUpdate_Queue_Subject_13306 jobs=606126,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700
boxname_RealTimeUpdate_Queue_Subject_13314 jobs=576714,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700
boxname_RealTimeUpdate_Queue_Subject_13342 jobs=294846,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700
boxname_RealTimeUpdate_Queue_Subject_13347 jobs=376998,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700
boxname_RealTimeUpdate_Queue_Subject_13359 jobs=470508,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:58 -0700
boxname_RealTimeUpdate_Queue_Subject_13364 jobs=403182,restarts=0,memory_MB=7.03,lastcheckin=Tue, 23 Mar 2010 22:37:58 -0700
boxname_RealTimeUpdate_Property_SolrPublish_ jobs=219630,restarts=0,memory_MB=6.19,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700
boxname_RealTimeUpdate_Queue_TriggerWorker_2 jobs=393642,restarts=0,memory_MB=4.27,lastcheckin=Tue, 23 Mar 2010 22:37:59 -0700
boxname_RealTimeUpdate_Property_SolrBatchPub jobs=6,restarts=0,memory_MB=6.23,lastcheckin=Tue, 23 Mar 2010 22:37:28 -0700

Brilliant.. no need for html or HTTP.. just a nice simple commandline interface.

I think gearman still has a ways to go. I’d really like to see some more administration added to it. Deleting empty queues and quickly flushing all queues without restarting gearmand would be nice to haves. We’ll see what happens going forward, but for not, thanks so much to the gearman team (especially Eric Day who showed me gearman, and Brian Aker for pushing hard to release v0.12).

w00t!

One thought on “Gearman K.O.’s mysql to solr replication

  1. Thanks for sharing useful information about the simple trigger based replicator. It’s amazing that a simple trigger based replicator from mysql to SOLR can demonstrate such technology, along with rebuilding the search index in parallel using multiple CPU’s (due to solr/lucene for being able to update indexes concurrently and efficiently ) . I referred to http://www.lucidimagination.com/search/?q=indexing for clearer understanding of the concept which was benefial.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>