Paginating with Riak

The question of pagination comes up from time to time on the Riak mailing list and in #riak on irc.freenode.net, most recently a few days ago. In reply, I always say something along the lines of "No. Riak does not do pagination." Let's take a look at what pagination is and why Riak has a hard time doing it. Pagination is generally defined as the ordered numbering of pages in a publication, usually a book. Now let's take that book and make it a Hot 100 list of super cool things that we want to put on a website. As far as we are concerned pagination is the ability to select a subset of information, in sequence, from a larger set of information.

Let's work with the numbers 1 through 100, in order. We could interest ourselves with the numbers one at a time or, perhaps, 10 at a time. If we were to page through those numbers we would have to know primarily two things: where are we starting and how much do we want. In addition, any meaningful pagination would require the larger set to be sorted. Working with our earlier definition and our example, Riak presents one chief complaint: sorting.

Riak at its core is a distributed key/value persisted data store that also happens to do a lot of other things. Now break that down. Looking at those words individually we have "distributed", meaning that your data lives on a number of different machines in your cluster. Good thing, right? Yes. However it also means that no single machine is the canonical reference for all your data. Which in turn means that you need to ask multiple machines for your data and those machines will return data to you when they see fit, ie. not in order. Moving on, we have "key/value". In regards to the topic at hand, this means that Riak has no insight into any data held within your keys, ie. Riak does not care if your stored json object has an age value in it. Next, we have "persisted". Riak has no native internal index, meaning Riak will not store on disk the data you send it in any useful way - useful to you at least. Lastly, we have "happens to do a lot of other things." Thankfully for us, one of those other things is Map/Reduce.

Map/Reduce is where all those previous sorting problems, uh, sort themselves out. Map/Reduce is basically the way for you to marshal your data in Riak. Basically, Map/Reduce takes your unsorted heaping mess of data and whips it into shape. I'll be using the riak-js module for nodejs to talk to Riak and walk through our example. Using a stock Riak install we will populate 100 keys, named 1 through 100, with a simple json object and select a subset of those keys using m/r. This example expounds on a brief mention of the subject in a Basho blog post from the summer of 2010. We will be taking advantage of a number of built-in javascript functions that come bundled with Riak. See you after the code.

Basically run populate-riak to populate a bucket, then run paginate-riak to get a "page" of that data back. All this works off the command line. Cool, right? Well, ya. Except... if you are contemplating running a site with any meaningful scale this will not function that well for you. Hmm, why is that? Well, on its face this method will work but what it is doing is pulling all records in your bucket and sorting your result set on the fly - every time you call it. This will fall down as the number of records in your system grows, aka. the number of records you need to sort grows. You really need to employ caching at different layers of your application to make this work better. Allowing your users to run the above every time they want to paginate a set of records is just a recipe for disaster. As an ad-hoc query run once in a while it should work fine, ie. perhaps run on a frequency to build a paginated cache that your user facing application hits directly.

Bear in mind that this is not a knock on Riak. It is simply a limitation that is inherent in the design of Riak. When evaluating a persistent data store you should take into account the good, of which Riak has a fair amount of, and the not so good. This is just one area where your application will have to accomodate the shortcomings by making judicious usage of pre-emptive caching. Now, when asked in the future whether or not Riak supports pagination I'll simply give a qualified "Sort of."

Scammers are a scammin!

Wow, I got full on involved with a scammer for the first time just earlier this morning. So I get an instant message on facebook from a friend of mine, lets call him Yoni. What's up Yoni? How goes... bla bla. Right? No, not today. Today someone decided to hijack Yoni's facebook account and try scamming his friends. Well, color me impressed! Instantly I realize this is a scam so I try to put "Yoni" to a test. Obviously sideswiped. So lets play on, player.... I am posting the logs, host lookup, whois lookup and a screenshot or two. We'll see where this goes, I just started talking to the scammer again. This is really a lot of fun.

Obviously Yoni's entire setup is compromised. Not only his Facebook but also his Yahoo email. Either he is using the same password for multiple sites (tisk tisk) or his machine got so rooted that they are streaming keystrokes back to the mother ship. Either way he is pwned in a major way. Yes I already left him and his girlfriend a voice mail and sent them texts so, If either of you are reading this change all your passwords pronto. Also, reformat your computers.

Read on for some classic interweb scammin comedy and a followup from our clueless scammer. I actually got him to take a look at this post. This moron was actually waiting for me to send him the confirmation code. Honestly, who falls for this stuff?

I tell you, this is some of the most fun I've had online in quite some time. In the immortal words of the great Kanye West:

...

Let's have a toast for the douchebags,
Let's have a toast for the assholes,
Let's have a toast for the scumbags

...

I couldn't agree with you more, Kanye. Three cheers. Or in the immortal words of my dear scammer: 

idiot u

 

Security pro tip:

-Never share passwords between multiple accounts.

-Don't use windows (flame on)

-Don't use IE (flame on)

-Don't expect to never get hacked, it may very well happen to you (or me)

 

 

 

Using Riak's map/reduce for sorting

From a database perspective, Riak is a schemaless, key/value datastore. The focus of this post is to show you how to do the equivalent of the sql "SORT BY date DESC" using Riak's map/reduce interface. Due to Riak's schemaless, document focused nature Riak lacks internal indexing and by extension, native sorting capabilities. Additionally, Riak does not have a single file backend. The primary default backend is called Bitcask but Riak does offer a number of different backends for specific use cases. This makes an internal general purpose index implementation impractical, especially so once you factor in the distributed nature of the platform.

So how does a sort actually work in this environment? Map/Reduce. Riak implements map/reduce as its way of querying the riak cluster. Lets keep this description light and simply say: Riak brings your query (for the most part) to the node where your data lives. The map part of your query is distributed about the cluster to the nodes where the data resides, executed, then results sent back to the originating node for the reduce phase. You can write your map/reduce query in two different languages - erlang and javascript (Spidermonkey is the internal JavaScript engine.)

So now that you have a basic theoretical underpinning, how does this actually work in practice? I'm including here a snippet of a heavily commented javascript function that i use in one of my nodejs apps. The bridge between nodejs and Riak is a module called riak-js (disclosure, I've contributed some patches.) Let's take a look, I'll see you on the other side.

Lets break this down. This function is part of a larger nodejs application that uses the fu router library lifted from node_chat, a quite approchable getting-to-know-node example application. No you can not cut and paste this code somewhere and have it work. What you should do is take a look at the map and reduceDescending variables (lines 15 and 40). Those functions are written in javascript and sent over the wire to riak. Lets go over some of the magic that makes this work.

Riak will gladly accept a bucket as it's input mechanism in a map/reduce. Although Basho has done a good amount of work to make this performant, simply passing a bucket will force an expensive list:keys operation internally. The more keys you have in your system the longer this will take. Sometimes this is unavoidable or even desirable. Most likely you will want to expressly pass keys to the map/reduce job. This is done in the format:

[ ["bucket","key1"],["bucket","key2"],["bucket","key3"],["bucket","key4"] ] 

Now, although I'm passing the keys here in order (key1... keyN), recall that riak has no internal concept of ordering. The map phase will seek out the keys wherever they live and the result is not guaranteed to be ordered. What is needed is to sort the result set in the reduce phase once all the data has been collected. In this case I will be sorting by the X-Riak-Last-Modified header which is a date kept in the format "Tue, 31 Aug 2010 06:46:02 GMT". Well, that doesn't look like a sortable string, does it? The trick is to turn it into an int, as I do on line 28:

o.lastModifiedParsed = Date.parse(v["values"][0]["metadata"]["X-Riak-Last-Modified"]); 

Here the string date is pulled out of the header and converted via the native javascript function Date.parse() into an int. It is the int that allows the numeric sorting in the reduce phase on line 46:

v.sort ( function(a,b) { return b['lastModifiedParsed'] - a['lastModifiedParsed'] } );

The format "b-a" is what dictates descending order, conversely ascending order would be written as "a-b". Remember the value is embedded within a javascript object and needs to be accessed as such. This trick can be used with any integer value embedded in a json object. If my "key" (on line 30) were an int I could use that, or maybe a price or quantity value.

Map/reduce is a bit tricky to wrap you mind around when coming from a relational/sql background but the new breed of NoSQL databases available make it easy to duplicate many of those features. Riak exposes a fully functional map/reduce implementation to get at all the nested parts of your complex json documents. So what are you waiting for? Get codin!