Aug 29 2008

Can more queries equal a healthier MySQL server?

This week was an ugly one for my monster database servers. It should have been triumphant, but oddly enough, I think it shows how prone to mistuning InnoDB on MySQL 5.0 is with multiple cores.

This server is a multi-core, high concurrency server. The application has been designed a little bit naively in that it just throws almost all queries at the main db server. Several bits have been designed to scale by not doing that, but unfortunately, huge amounts of functionality were built around those apps to prevent them from scaling.

As a result, we’ve had to scale up the central database server and its redundant systems significantly. We started with the Proliant DL380 G4 with two Xeon 3.4Ghz CPU’s and 12GB of RAM, and plenty of disks in an external RAID. As more traffic was added, we moved up to the DL580 servers with 4 Xeon 3.4Ghz and 64GB of RAM. This worked well, but still more traffic, and more data, was coming and the app wasn’t ready to change significantly. We finally landed on the latest DL580 server, with 1GB of total battery backed write cache, 14 SAS disks, 128GB of RAM, and two quad core Xeon CPU’s.

Some things got better. Writes were now incredibly fast. The server was churning out 1000 queries per second easily. Sometimes during peak times, query response time would suffer, but ultimately, the box was keeping up and performing well. Especially after we turned of query caching. After this week though, I wonder how much of the problem was query caching… more later.

Anyway, whenever the server would need to have maintenance, some high traffic applications would suffer needlessly for their need of rarely changing data (memcached was out of the question for the complexity and “realtime” nature of this data). So we setup a selective replication fanout onto multiple boxes and pointed these apps at that cluster for these queries.

Well the next day, without all of these tiny queries pounding on it, the database server had horrible problems. 400 threads stacked up inside InnoDB “Waiting for InnoDB queue”. System resources were fine, but it was clear, InnoDB was having trouble. Queries that normally take 0.75 seconds were taking 300+ seconds, or just never completing. I knew there was real trouble, when killing the thread would result in it just changing state to “Killed”, but never dying. Based on what I’d read in High Performance MySQL, and articles like this one, I tried twiddling with innodb_thread_concurrency, innodb_concurrency_tickets, and innodb_thread_sleep_delay. None of them seemed to help, though innodb_thread_concurrency set to a value of about half the CPU cores seemed to delay the problems.

I noticed that we were running MySQL v5.0.51a still. We had planned an upgrade to 5.0.67, which was just recently released, but hadn’t gotten there yet. I went ahead and upgraded one of the boxes to it, and failed over to it. Instantly things were more healthy, and the health seemed to stay for hours, without any more InnoDB freakouts.

After some research, it would seem that between 5.0.51a and 5.0.67, a lot of really big fixes were made to InnoDB to help it scale up on multi-core machines. The box has been healthy for a couple of days, though there’s still a lot of work to do removing query load from the server.

But why would a _reduction_ in queries cause concurrency problems? I have a theory, but no real ideas on how to test it.

Before, we were doing 1000 queries per second. Things were healthy. We removed about 400 queries per second from that. These 400 queries were basically instantaneous.. often times returning no results at all and reading from tables and indexes completely stored in the innodb_buffer_pool. But, with query cache turned off, they were still being processed fully by InnoDB. When we removed these tiny queries from the queue imposed by innodb_thread_concurrency, I think we removed the equivalent of spin waits from the queue. These tiny, easy queries were just hard enough to process, to prevent a lot of bigger queries from hitting the queue at the same time. Thats why reducing innodb_thread_concurrency to 4 helped a bit.. with only 4 threads vying for mutexes and CPU resources constantly, InnoDB was able to (sort of) keep up.

My final bit of evidence for this is that we actually, I think, had this problem before with the aforementioned article. Turning off the query cache moved these tiny queries out of the query cache, and into the InnoDB queue, providing the needed pseudo-spin-waits to prevent it from locking in on itself.

I have to wonder if raising innodb_sync_spin_loops to something ridiculously high, like 50000, would have the same effect. Unfortunately, its very hard to test this without dedicating a lot of time to it.

So, in this case, it would seem that more work can, in fact, make the server healthier.


Aug 12 2008

The new fad: Outsourced Parachute Packing

Holy cow, did you read about this company “The Linkup” losing 45% of its customers’ data?! How about they change their name to “The @$%! Up”.

First off, let me say that these guys didn’t have to be retarded to lose this much data. In fact, there are (were?) probably a lot of really talented people who designed and built this system to avoid such things.

I’m an optimist, so I have to believe somebody raised their voice at a meeting when data was shipped off to some loosely linked company from some past relationship. The finger pointing going on now is exactly what nobody ever wants to see happen to something they built.

Nirvanix says it has not deleted any customer data, and promises that its Storage Delivery Network is immune to the problem that plagued The Linkup. At The Linkup, a “system administrator ran a script that misidentified active account data and disassociated physical files from their owners,” Nirvanix says. “This led to files being marked offline in the old Streamload/MediaMax file system when they shouldn’t have been.” Iverson, meanwhile, claims it was a Nirvanix engineer who caused the data loss.

Hiring managers beware.. if you receive a resume of a Systems Administrator from the San Diego area with a couple of years missing and immediate availability, you might want to ask him if he knows what the -f argument to rm does.

We’ve all been there at the meeting that produces this type of situation though. Somebody finds some amazing 3rd party vendor that will do for the company what now takes a little resources, but will, when the company gets successful, take a lot of resources. Rather than scale our business up, why not just jump onto the coat tails of some other already successful business and rake in the dough, not needing any real technology, just gluing things together.

This is the essence of things like Amazon S3 and EC2, and Google’s AppEngine. These companies have built massive clusters to deal with their biggest days.. and somebody over there got smart and said “hey, 1% of our architecture is 200 times the power of most startups. We should sell it to them.”

The problem is, one day Amazon will need that power, and they won’t think twice about cutting off their tiny little source of revenue to make sure they can sell copies of Oprah’s latest selection. Their core business is selling stuff over the web, so why should they care if EC2 and S3 go slow, stop working, or function better than most people’s multi-thousand-dollar-a-month contracts with ISP’s.

So, if something is key to your success, even if its not “your core competency”, I say build it into your business model to gain that competency. I wouldn’t pay somebody to pack my parachute if I were a sky diver. I’d pay somebody to teach me, maybe even to check it for me, but I want to know that I packed it right when I pull the cord.. I don’t want a little flag to fly out saying “Sorry, We gave up our parachute packing business 2 days ago. Might want to start praying..”