So I was starting to play with Memcached for session storage, and I found a fairly big problem with justing memcached in its normal caching mode as a session store. It really just boils down to caching and storing of deterministic data being very different things that only look similar on the surface.

So normally, memcached is used in a very clever way by adding a list of servers, and then using a hashing algorithm to pick a server to actually contact based on the key of a get/set request. This allows a ton of scaling out, with minimal moving parts. There's no periodic monitor or broadcast protocol to add and remove cluster members to and from pools, so you can just run memcached on a bunch of servers, and use a consistent list across all of your machines to achieve a huge degree of scale out. When a server dies, the code just sees that, and moves on to the next one in the hash algorithm, and all is well.

For caching, this "failover" methodology works fine. If I go to set a value in memcached, and the server fails over to the second one, thats ok. The next get to the primary will fail, and get set properly, and the old entry on the secondary will *eventually* get pushed out of the cache.

However, for storing data reliably, this becomes a problem. Lets say there is a scenario where a network cable is bad on one of the memcached servers. 1 in 100 requests fails. With caching, failover will go a little nuts, but its entirely possible nobody will even notice, as results will be cached, data won't get stale.. no big deal.

With storage though, this could happen..

- session is created on memache1

- session tries to read from memcache1, and fails.. so new session is created on memache2

- session is then read from memache1

- session is updated on memcache1 with new information

- session fails to read from memcache1, and old session data is read from memacache2, then the set succeeds on memcache1, and the old data is lost.

The point isn't really this scenario's details, but that this hashing algorithm is vulnerable, even designed to lose data that was written to it. That is the caching paradigm.

As I discussed this with some colleagues, my mind immediately jumped to MemcacheDB. Maybe that would work for session storage. It has replication, so we could use the traditional active/passive paradigm for it. However, this limits our scale to whatever a single instance of MemcacheDB can handle. Honestly thats probably fine for most sites, as MemcacheDB can probably handle tens of thousands of small writes per second.

However, there are multiple problems. The biggest problem with MemcacheDB is there's no easy way (yet, they're working on it) to pull keys out of it to do garbage collection. Likewise, session data really doesn't need to live for a long time. We just need to be reasonably certain that the data we're getting is reasonably new.

If we store the data in *all* of the servers, and if we store a highly accurate (meaning if it takes you milliseconds to complete a request, this timestamp needs to be down to microseconds) timestamp of when the data was given to us (meaning we use the same timestamp for each server) along side it, we can then just read it from all of the servers, and pick the newest one. Ew, that means we are still limited to the scale of one instance of memcached.

Then I had a flash back to the way MogileFS works. It stores data on a number of replica servers. Of course, it also keeps track of where it stored them. But I figured, for sessions, thats a lot of overhead. There's an easier way. We can use the consistent hashing algorithm that the PHP Memcache module uses to pick servers, and just read and write the data from nReplicas servers. If a server fails, we'll move on to the next one, and there's a reasonable degree of certainty that it will remain the same. If we write stale data to a server and then fail back to it later, we're protected by the timestamp rules. The higher nReplicas, the higher the reliability that a server failure won't cause issues. I even found a PHP implementation of consistent hashing falled FlexiHash.

There's one last issue that bugs me about using memcached for sessioning, and the timestamp helps us solve. We recently found that there was a problem where a request would take, say, 45 seconds to complete. At 20 seconds, the user would hit the back button out of frustration. This would put other stuff in the session, then the 45 second request would complete, and write the version of the session it thinks is right to the session store, losing the user's new activity.

There are two ways to solve this. One is to introduce locking. This actually isn't hard to do with Memcached, it is described in the memcached faq. However, this introduces something to block or fail on in the read. I think its simpler than that. You simply read the record before you write it, and if it has changed since you read it the first time, you don't write it. You just throw the session write away. Obviously the user has moved on, so there's no reason to make your update.  If you used locking, the user would still be waiting on the old thread to finish.

Of course, this all hinges on you caring that your session data is accurate, and that you care that users don't lose their sessions when one server goes down. If neither of those apply to you, then you can just use sessions like cache.