Getting a Random Record From a MongoDB Collection

One of my issues with MongoDB is that, as of this writing, there is no way to retrieve a random record. In SQL, you can simply do something similar to “ORDER BY RAND()” (this varies depending on your flavor) and you can retrieve random records (at a slightly expensive query cost). There is not yet an equivalent in MongoDB because of its sequential access nature. There is a purely Javascript method in the MongoDB cookbook here. If you are really interested, I would also read the Jira ticket thread #533 on this issue.

Although it feels a little dirty and kind of hackish, here is how I accomplished getting a random record using the Mongo-Ruby driver. Part of this is documented in the cookbook article I linked to above, but I reiterate bits and pieces of it here. This is essentially the same thing that any “ORDER BY RAND()” statement is doing, its just not doing it “on the fly”.

The first thing you’ll have to do is add an additional column to the collection; we’ll call it random. For the ease of use, we’ll also say that every value that goes in this column is between 0 and 1 (and can therefore be generated via Kernel.rand()). This is important because we are going to use it as our criteria for finding a random record.

First, initialized the connection to the database and bind an instance variable to a collection. Then generate the random number that you are going to use to find a random record. Now we try to find_one document that is greater than or equal to our random number. In case we miss, we also do a less than or equal to next. This means that as long as we have at least 1 document in our collection, we will return a record. The more documents in the collection, the better the randomness of the returned document.

@@mongodb = Mongo::Connection.new("localhost", 27017).db("test_db}")
@collection = @@mongodb["collection_name"]

@rand = Kernel.rand()
@random_record = @collection.find_one({ 'random' => { '$gte' => @rand } })
if @random_record.nil?
  @random_record = @collection.find_one({ 'random' => { '$lte' => @rand } })
end

For reference, a mongodb collection with a random column may look like this:

{
        "_id" : ObjectId("4c5c710e41b89d657d000001"),
        "url" : "http://www.example.com/",
        "created_at" : "Fri Aug 06 2010 16:32:15 GMT-0400 (EDT)",
        "random" : 0.45929463868260356
}
Posted in MongoDB. Tags: , . 10 Comments »
  • Pingback: Tweets that mention Getting a Random Record From a MongoDB Collection | Erics Tech Blog -- Topsy.com

  • bronson

    I'm not sure I agree with this… Your code makes some records much more likely to be returned than others. True, your random column has a gaussian distribution, but it's still lumpy. Some records will be close to others and some will be far away.

    For example, if the random column has 0.27, 0.31, 0.44, and 0.66, there's a 17% chance that the second number will be returned, and a 54% chance that the last number will be returned.

    I think you'd do better to evenly subdivide the random column: 0.5, 0.25, 0.75. 0.125, 0.375, etc… With an smooth distribution in the random column, each record is equally likely to be returned.

    Of course, until you completely fill a row, the higher entries in the previous row are more likely to be returned. if your database is filling quickly, this should not be a problem. If not, you could check if you hit an unallocated part of the last row and, if so, just call rand() again. Or, you could just each new row completely with null records, then fill them in as they arrive.

  • http://eric.lubow.org Eric Lubow

    Honestly, I am not all that familiar with creating smooth distribution of random numbers. I actually initially considered using a Mersenne Twister algorithm as it my understanding that it is a smoother distribution, but for my purposes, I didn't think it mattered. After looking at what you wrote, I think it may if Kernel.rand() is especially biased.

    However, the Mersenne Twister algorithm (according to wikipedia) seems to be sensitive to poor initialization. Since the random number generator will be reinitialized at the creation/update of each, this could also pose a bias problem (if I understand correctly). I suppose I could initialize once at application start up and then let normal distribution work its course. Is there potentially a better way to achieve this with a steadily growing number of rows?

  • http://twitter.com/scsmith Steve Smith

    How about using a combination of count and offset? I discussed a similar issue when using a traditional RDBMS on my blog. Sure the approach also has limitations when things are changing very quickly however in most cases it will work as well as any other solution. Storing a random number on the record is a better concept for performance than working it out on the fly but it does add disk overhead.

  • Guest

    whats with:
    collection = connection.db(“mydb”).collection(“mycollection”)
    random_entry = collection.find().limit(-1).skip(rand(collection.count())).next_document()

    its never nil 🙂

  • http://www.lovemikeg.com/ Mike G.

    Going off of the previous comment, I was able to come up with this:

    PHP:
    $c->find()->limit( -1 )->skip(mt_rand( 0, $c->count() ))->getNext()

    JavaScript Shell:
    c.find().limit( -1 ).skip( _rand() * c.count() )

    It seems the best solution is to keep it within the skip/rand pattern.

    MG

  • Sprog

    I think I understand… It’s not the random number generation that’s the problem; it’s the numbers that you are storing in the collection. Take the above example – 0.27, 0.31, 0.44, and 0.66.
    The problem is that the distribution of those numbers isn’t even, and in 54% of cases you will end up with a number greater than 0.66 and therefore be returned 0.66. This is in contrast to the 17% chance you will get 0.31.

    If you stored your random value in the database with an even distribution then there is an equal chance of picking any row.

  • Pingback: ec2-consistent-snapshot With Mongo | Revolusionline

  • http://www.facebook.com/olliethebastard Ollie Harridge

    I had some issues with this method. The probability of each document being returned is not equal, so it did not meet my needs. Also a big problem with this method was that the documents were returned in disk order, so some were always returned more often than others. I could get round this by using .sort({‘random’=>1}) but this was defeating the point of only scanning 1 document.  I have left more detailed comments on the cookbook page
    http://cookbook.mongodb.org/patterns/random-attribute/ 

  • David James

    There is an active feature request on MongoDB’s JIRA ticket tracker. If you would like Mongo to support randomized ordering of queries, please check it out and vote for it: https://jira.mongodb.org/browse/SERVER-533