Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page! Currently only showing blog entries under the category: MongoDB. Clear filter

Here's a little interesting story about using MongoKit to quickly draw items from a MongoDB

So I had a piece of code that was doing a big batch update. It was slow. It took about 0.5 seconds per user and I sometimes had a lot of users to run it for.

The code looked something like this:

 for play in db.PlayedQuestion.find({'user.$id': user._id}):
    if play.winner == user:
         bla()
    elif play.draw:
         ble()
    else:
         blu()

Because the model PlayedQuestion contains DBRefs MongoKit will automatically look them up for every iteration in the main loop. Individually very fast (thank you indexes) but because of the number of operations very slow in total. Here's how to make it much faster:

   for play in db.PlayedQuestion.collection.find({'user.$id': user._id}):

The problem with this is that you get dict instances for each which is more awkward to work with. I.e. instead of `play.winner` you have use `play['winner'].id`. Here's my solution that makes this a lot easier:

class dict_plus(dict):

  def __init__(self, *args, **kwargs):
       if 'collection' in kwargs:  # excess we don't need
           kwargs.pop('collection')
       dict.__init__(self, *args, **kwargs)
       self._wrap_internal_dicts()

   def _wrap_internal_dicts(self):
       for key, value in self.items():
           if isinstance(value, dict):
               self[key] = dict_plus(value)

   def __getattr__(self, key):
       if key.startswith('__'):
           raise AttributeError(key)
       return self[key]

  ...

 for play in db.PlayedQuestion.collection.find({'user.$id': user._id}):
    play = dict_plus(play)
    if play.winner.id == user._id:
         bla()
    elif play.draw:
         ble()
    else:
         blu()

Now, the whole thing takes 0.01 seconds instead of 0.5. 50 times faster!!

Because this took me a long time to figure out I thought I'd share it with people in case other people get stuck on the same problem.

The problem is that Mongoose doesn't support DBRefs. A DBRef is just a little sub structure with a two keys: $ref and $id where $id is an ObjectId instance. Here's what it might look like on the mongodb shell:

> db.questions.findOne();
{
       "_id" : ObjectId("4d64322a6da68156b8000001"),
       "author" : {
               "$ref" : "users",
               "$id" : ObjectId("4d584fb86da681668b000000")
       },
       "text" : "Foo?",
       ...
       "answer" : "Bar"
       "genre" : {
               "$ref" : "question_genres",
               "$id" : ObjectId("4d64322a6da68156b8000000")
       }
}

DBRefs are very convenient because various wrappers on drivers can do automatic cross-fetching based on this. For example, with MongoKit I can do this:

for question in db.Question.find():
   print question.author.first_name

If we didn't have DBRefs you'd have to do this:

for question in db.Question.find():
   author = db.Authors.findOne({'_id': question.author})
   print author.first_name

Anyway, the problem Mongoose has is that it doesn't support DBRefs so when you define its structure you have to do this:

var QuestionSchema = new mongoose.Schema({
  text     : String
  , answer      : String
  ...
  , genre: {},
  , author : {}
});
mongoose.model('Question', QuestionSchema);
var Question = mongoose.model('Question', 'questions');

Now, that sucks but I can learn to live with it. Giving this schema here's how you can work with the defined model from an existing database:

Question.find({}, function(err, docs) {
  docs.forEach(function(each) {
     each.findAuthor(function(err, item) {
        console.log(item.doc.first_name);
     });     
  });
});

(note: I have no idea with this "doc" struct is but perhaps the gods of Mongoose can explain that one)

So, given that I can't work with DBRefs in Mongoose; how do to mock it? Here's how:

    var author = new models.Author();
    author.username = "peter";
    author.save(function(err) {
        var question = new models.Question();
        // fake a dbref
        question.author = {
           "$ref" : "users",
           "oid" : user._id
        };
        question.save(function(err) {
           question.findAuthor(function(err, u) {
              assert.ok(u);
              assert.ok(!err);
              test.equal(u.doc.first_name, 'peter');
              test.done();
           });
        });
     });

That's the magic. This way you can pretend, in your tests, that you have objects with proper DBRefs. It feels strangely convoluted and hackish but I'm sure once I've understood this better there might be a better way.

Perhaps I could have spent the time it took to figure this out and to write this blog I could have stepped up and written a DBRef plugin for Mongoose.

MongoUK 2011 - London conference all about MongoDB This is a summary about the MongoUK conference held today here in London. It was great! Unlike other commercial conferences this one actually cost money but it was only about $50 and although there were free coffee cups, stickers and stuff this didn't feel at all like they were trying to sell themselves. The focus was on the technology. Great!

As often with these things, you realise that you have actually grasped a couple of things quite well now but it's also humbling in that there is so much you haven't grasped that other people are way ahead of you in.

What I learned

1) DBRefs aren't anything special. It's just an embedded document. The difference between saving an ObjectId from another collection is that with a DBRef you also get some information about which collection that belongs to. So, if your "foreign reference" is called user then you don't really need to remind yourself (and your disk space and your index space) that it belongs to the collection users. Having said that, I think ORMs (e.g. MongoKit) will do a better job automatically wrapping these references if you use DBRefs.

2) ensureIndex() is not cheap. Even if it doesn't necessarily change your structure. Don't take it lightly. Do it as an administration task and treat it such. Get it right because it is really important and the name "ensure" might sound more harmless than it actually is if the collection is huge.

3) Regular expressions can be indexed if they're "range-able" (my own word). For example if you index a field called username you can use your index if you do a query on /^pet/. Also, on the subject of indexing, if you have a compound index (e.g. ensureIndex({x: 1, y:1})) you can use the index if you just do a query on x. But not on y. Did I understand that correctly Richard?

Lastly regarding indexes, the indexing engine is very smart and polymorphic. If you ensure an index on an array it will index the elements of the array and not just the whole array as one entity.

4) Elliot Horowitz is a smart cookie

5) Journaling is cool but if your traffic is high, sharding and replication is more important. The addition of the journaling feature might be slightly more to just PR related but I like it because some setups are so small that I don't need 5+1 replica sets and then I really want and need journaling. To paraphrase Elliot: "Due to disks' inherent nature, corruption will always happen".

6) 10Gen is trying to attract more enterprise use by writing case studies but also to fulfil enterprise driven feature requests. Encryption is one such that is currently one such feature.

7) Map/Reduce is currently "one of the biggest annoyances of MongoDB" but this is being genuinely worked on. It currently works and is very flexible but it's clunky and can be slow due to Javascript being slow. What's being done is that all standard aggregating operators (e.g. sum()) will soon (2.0 release hopefully) become part of the core and done in C++. The other thing they're actively looking into is changing the Javascript engine to V8.

8) Triggers is Javascript is a very frequent feature request but they would be really quite hard to implement (and get right) and it would potentially be slow due to depending on the Javascript engine. It's on the roadmap but very far down. I'm happy for it to be introduced in 2027.

9) MongoDB has really thought about sharding. It's not something that has been thrown in good measure. The database (including config servers, load balancers and slaves) is all taken care of by MongoDB basically. The only challenging thing for you (developer) to think about is what key to shard on.

10) A company called Server Density have a really good looking application for monitoring MongoDB. It works by you installing a daemon on your MongoDB cluster and looking at the aggregated states and stuff on their application running on their own servers. David Mytton's presentation is best digested in these 6 blog articles

11) Version 2.0 is going to be another big step forward for MongoDB. In particular it's going to improve on concurrency which understandably is a hard problem to solve but now that MongoDB is maturing it's the right time to attack this.

Last but not least, I need to do something about the way I look or smell or something! I chatted to about 10-15 people and not a single one of them walked up to me. I had to start every single interaction. Am I that intimidating or are people perhaps not interested in talking to strangers?

MongoKit is a Python wrapper on top of pymongo that adds structure and validation and some other bits and pieces. It's like an ORM but not for an SQL database but for a document store like MongoDB. It's a great piece of code because it's thin. It's very careful not to molly cuddle you and your access to the source. What I discovered was that I was doing an advanced query and with the results they we instantiated as class instances and later turned into JSON for the HTTP response. Bad idea. I don't need them to be objects really so with MongoKit it's possible to go straight to the source and that's what I did.

With few very simple changes I managed to make my restful API app almost 10 times faster!!

Read the whole story here

At Euro DjangoCon I met lots of people and talked a lot about MongoDB as the backend. I even did a presentation on the subject which led to a lot of people asking me more questions about MongoDB.

I did mention to some people that one of the drawbacks of using MongoDB which doesn't have transactions is that you have to create and destroy the collections (like SQL tables) each time for every single test runs. I thought this was slow. It's not

Today I've been doing some more profiling and testing and debugging and I can conclude that it's not a problem. Creating the database has a slight delay but it's something you only have to do once and actually it's very fast. Here's how I tear down the collections in between each test:

class BaseTest(TestCase):

   def tearDown(self):
       for name in self.database.collection_names():
           if name not in ('system.indexes',):
               self.database.drop_collection(name)

For example, running test of one of my apps looks like this:

$ ./manage.py test myapp
...........lots.............
----------------------------------------------------------------------
Ran 55 tests in 3.024s

So, don't fear writing lots of individual unit tests. MongoDB will not slow you down.