Using Riak's map/reduce for sorting

From a database perspective, Riak is a schemaless, key/value datastore. The focus of this post is to show you how to do the equivalent of the sql "SORT BY date DESC" using Riak's map/reduce interface. Due to Riak's schemaless, document focused nature Riak lacks internal indexing and by extension, native sorting capabilities. Additionally, Riak does not have a single file backend. The primary default backend is called Bitcask but Riak does offer a number of different backends for specific use cases. This makes an internal general purpose index implementation impractical, especially so once you factor in the distributed nature of the platform.

So how does a sort actually work in this environment? Map/Reduce. Riak implements map/reduce as its way of querying the riak cluster. Lets keep this description light and simply say: Riak brings your query (for the most part) to the node where your data lives. The map part of your query is distributed about the cluster to the nodes where the data resides, executed, then results sent back to the originating node for the reduce phase. You can write your map/reduce query in two different languages - erlang and javascript (Spidermonkey is the internal JavaScript engine.)

So now that you have a basic theoretical underpinning, how does this actually work in practice? I'm including here a snippet of a heavily commented javascript function that i use in one of my nodejs apps. The bridge between nodejs and Riak is a module called riak-js (disclosure, I've contributed some patches.) Let's take a look, I'll see you on the other side.

Lets break this down. This function is part of a larger nodejs application that uses the fu router library lifted from node_chat, a quite approchable getting-to-know-node example application. No you can not cut and paste this code somewhere and have it work. What you should do is take a look at the map and reduceDescending variables (lines 15 and 40). Those functions are written in javascript and sent over the wire to riak. Lets go over some of the magic that makes this work.

Riak will gladly accept a bucket as it's input mechanism in a map/reduce. Although Basho has done a good amount of work to make this performant, simply passing a bucket will force an expensive list:keys operation internally. The more keys you have in your system the longer this will take. Sometimes this is unavoidable or even desirable. Most likely you will want to expressly pass keys to the map/reduce job. This is done in the format:

[ ["bucket","key1"],["bucket","key2"],["bucket","key3"],["bucket","key4"] ] 

Now, although I'm passing the keys here in order (key1... keyN), recall that riak has no internal concept of ordering. The map phase will seek out the keys wherever they live and the result is not guaranteed to be ordered. What is needed is to sort the result set in the reduce phase once all the data has been collected. In this case I will be sorting by the X-Riak-Last-Modified header which is a date kept in the format "Tue, 31 Aug 2010 06:46:02 GMT". Well, that doesn't look like a sortable string, does it? The trick is to turn it into an int, as I do on line 28:

o.lastModifiedParsed = Date.parse(v["values"][0]["metadata"]["X-Riak-Last-Modified"]); 

Here the string date is pulled out of the header and converted via the native javascript function Date.parse() into an int. It is the int that allows the numeric sorting in the reduce phase on line 46:

v.sort ( function(a,b) { return b['lastModifiedParsed'] - a['lastModifiedParsed'] } );

The format "b-a" is what dictates descending order, conversely ascending order would be written as "a-b". Remember the value is embedded within a javascript object and needs to be accessed as such. This trick can be used with any integer value embedded in a json object. If my "key" (on line 30) were an int I could use that, or maybe a price or quantity value.

Map/reduce is a bit tricky to wrap you mind around when coming from a relational/sql background but the new breed of NoSQL databases available make it easy to duplicate many of those features. Riak exposes a fully functional map/reduce implementation to get at all the nested parts of your complex json documents. So what are you waiting for? Get codin!