Secs sell! How I cache my entire pages (server-side)
10 May 2012
1 comment
Python, Django
http://www.peterbe.com/stats/
I've blogged before about how this site can easily push out over 2,000 requests/second using only 6 WSGI workers excluding latency. The reason that's possible is because the whole page(s) can be cached server-side. What actually happens is that the whole rendered HTML blob is stored in the cache server (Redis in my case) so that no database queries are needed at all.
I wanted my site to still "feel" dynamic in the sense that once you post a comment (and it's published), the page automatically invalidates the cache and thus, the user doesn't have to refresh his browser when he knows it should have changed. To accomplish this I used a hacked cache_page decorator that makes the cache key depend on the content it depends on. Here's the code I actually use today for the home page:
def _home_key_prefixer(request):
if request.method != 'GET':
return None
prefix = urllib.urlencode(request.GET)
cache_key = 'latest_comment_add_date'
latest_date = cache.get(cache_key)
if latest_date is None:
# when a blog comment is posted, the blog modify_date is incremented
latest, = (BlogItem.objects
.order_by('-modify_date')
.values('modify_date')[:1])
latest_date = latest['modify_date'].strftime('%f')
cache.set(cache_key, latest_date, 60 * 60)
prefix += str(latest_date)
try:
redis_increment('homepage:hits', request)
except Exception:
logging.error('Unable to redis.zincrby', exc_info=True)
return prefix
@cache_page_with_prefix(60 * 60, _home_key_prefixer)
def home(request, oc=None):
...
try:
redis_increment('homepage:misses', request)
except Exception:
logging.error('Unable to redis.zincrby', exc_info=True)
...
And in the models I then have this:
@receiver(post_save, sender=BlogComment)
@receiver(post_save, sender=BlogItem)
def invalidate_latest_comment_add_dates(sender, instance, **kwargs):
cache_key = 'latest_comment_add_date'
cache.delete(cache_key)
So this means:
- whole pages are cached for long time for fast access
- updates immediately invalidates the cache for best user experience
- no need to mess with ANY SQL caching
So, the next question is, if posting a comment means that the cache is invalidated and needs to be populated, what's the ratio of hits versus hits where the cache is cleared? Glad you asked. That's why I made this page:
It allows me to monitor how often a new blog comment or general time-out means poor django needs to re-create the HTML using SQL.
At the time of writing, one in every 25 hits to the homepage requires the server to re-generate the page. And still the content is always fresh and relevant.
The next level of optimization would be to figure out whether a particular page update (e.g. a blog comment posting on a page that isn't featured on the home page) should or should not invalidate the home page. esp
String length truncation optimization difference in Python
19 March 2012
8 comments
Python
We have a piece of code that is going to be run A LOT on a server infrastructure that needs to be fast. I know that I/O is much more important but because I had the time I wanted to figure out which is fastest:
def a(s, m):
if len(s) > m:
s = s[:m]
return s
...or...
def b(s, m):
return s[:m]
I wrote a simple benchmark that bombarded these two string manipulation functions with strings that were on average 50% chance longer than the max length. In other words, half of strings sent to these two functions where so short they didn't need to be truncated.
Turns out, there is absolutely no significant difference! I'm not even going to post the timings.
I could go on an repeat the iterations till my CPU starts to smoke but then I'm sure the benchmark is becoming invalid and needs to be re-engineered and by then the realm of the test becomes surreal. Now, carry on with your life of writing real code.
When to __deepcopy__ classes in Python
14 March 2012
9 comments
Python
When using mutables in Python you have to be careful:
>>> a = {'value': 1}
>>> b = a
>>> a['value'] = 2
>>> b
{'value': 2}
So, you use the copy module from the standard library:
>>> import copy
>>> a = {'value': 1}
>>> b = copy.copy(a)
>>> a['value'] = 2
>>> b
{'value': 1}
That's nice but it's limited. It doesn't deal with the nested mutables as you can see here:
>>> a = {'value': {'name': 'Something'}}
>>> b = copy.copy(a)
>>> a['value']['name'] = 'else'
>>> b
{'value': {'name': 'else'}}
That's when you need the copy.deepcopy function:
>>> a = {'value': {'name': 'Something'}}
>>> b = copy.deepcopy(a)
>>> a['value']['name'] = 'else'
>>> b
{'value': {'name': 'Something'}}
Now, suppose we have a custom class that overrides the dict type. That's a very common thing to do. Let's demonstrate:
>>> class ORM(dict):
... pass
...
>>> a = ORM(name='Value')
>>> b = copy.copy(a)
>>> a['name'] = 'Other'
>>> b
{'name': 'Value'}
And again, if you have a nested mutable object you need copy.deepcopy:
>>> class ORM(dict):
... pass
...
>>> a = ORM(data={'name': 'Something'})
>>> b = copy.deepcopy(a)
>>> a['data']['name'] = 'else'
>>> b
{'data': {'name': 'Something'}}
But oftentimes you'll want to make your dict subclass behave like a regular class so you can access data with dot notation. Like this:
>>> class ORM(dict):
... def __getattr__(self, key):
... return self[key]
...
>>> a = ORM(data={'name': 'Something'})
>>> a.data['name']
'Something'
Now here's a problem. If you do that, you loose the ability to use copy.deepcopy since the class has now been slightly "abused".
>>> a = ORM(data={'name': 'Something'})
>>> a.data['name']
'Something'
>>> b = copy.deepcopy(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/copy.py", line 172, in deepcopy
copier = getattr(x, "__deepcopy__", None)
File "<stdin>", line 3, in __getattr__
KeyError: '__deepcopy__'
Hmm... now you're in trouble and to get yourself out of it you have to define a __deepcopy__ method as well. Let's just do it:
>>> class ORM(dict):
... def __getattr__(self, key):
... return self[key]
... def __deepcopy__(self, memo):
... return ORM(copy.deepcopy(dict(self)))
...
>>> a = ORM(data={'name': 'Something'})
>>> a.data['name']
'Something'
>>> b = copy.deepcopy(a)
>>> a.data['name'] = 'else'
>>> b
{'data': {'name': 'Something'}}
Yeah!!! Now we get what we want. Messing around with the __getattr__ like this is, as far as I know, the only time you have to go in and write your own __deepcopy__ method.
I'm sure hardcore Python language experts can point out lots of intricacies about __deepcopy__ but since I only learned about this today, having it here might help someone else too.
Persistent caching with fire-and-forget updates
14 December 2011
4 comments
Python, Tornado
I just recently landed some patches on toocool that implements and interesting pattern that is seen more and more these days. I call it: Persistent caching with fire-and-forget updates
Basically, the implementation is this: You issue a request that requires information about a Twitter user: E.g. http://toocoolfor.me/following/chucknorris/vs/peterbe
The app looks into its MongoDB for information about the tweeter and if it can't find this user it goes onto the Twitter REST API and looks it up and saves the result in MongoDB.
The next time the same information is requested, and the data is available in the MongoDB it instead checks if the modify_date or more than an hour and if so, it sends a job to the message queue (Celery with Redis in my case) to perform an update on this tweeter.
You can basically see the code here but just to reiterate and abbreviate, it looks like this:
tweeter = self.db.Tweeter.find_one({'username': username})
if not tweeter:
result = yield tornado.gen.Task(...)
if result:
tweeter = self.save_tweeter_user(result)
else:
# deal with the error!
elif age(tweeter['modify_date']) > 3600:
tasks.refresh_user_info.delay(username, ...)
# render the template!
What the client gets, i.e. the user using the site, is it that apart from the very first time that URL is request is instant results but data is being maintained and refreshed.
This pattern works great for data that doesn't have to be up-to-date to the second but that still needs a way to cache invalidate and re-fetch. This works because my limit of 1 hour is quite arbitrary. An alternative implementation would be something like this:
tweeter = self.db.Tweeter.find_one({'username': username})
if not tweeter or (tweeter and age(tweeter) > 3600 * 24 * 7):
# re-fetch from Twitter REST API
elif age(tweeter) > 3600:
# fire-and-forget update
That way you don't suffer from persistently cached data that is too old.
Python file with closing automatically
03 December 2011
2 comments
Python
Perhaps someone who knows more about the internals of python and the recent changes in 2.6 and 2.7 can explain this question that came up today in a code review.
I suggest using with instead of try: ... finally: to close a file that was written to. Instead of this:
dest = file('foo', 'w')
try:
dest.write('stuff')
finally:
dest.close()
print open('foo').read() # will print 'stuff'
We can use this:
with file('foo', 'w') as dest:
dest.write('stuff')
print open('foo').read() # will print 'stuff'
Why does that work? I'm guessing it's because the file() instance object has a built in __exit__ method. Is that right?
That means I don't need to use contextlib.closing(thing) right?
For example, suppose you have this class:
class Farm(object):
def __enter__(self):
print "Entering"
return self
def __exit__(self, err_type, err_val, err_tb):
print "Exiting", err_type
self.close()
def close(self):
print "Closing"
with Farm() as farm:
pass
# this will print:
# Entering
# Exiting None
# Closing
Another way to achieve the same specific result would be to use the closing() decrorator:
class Farm(object):
def close(self):
print "Closing"
from contextlib import closing
with closing(Farm()) as farm:
pass
# this will print:
# Closing
So the closing() decorator "steals" the __enter__ and __exit__. This last one can be handy if you do this:
from contextlib import closing
with closing(Farm()) as farm:
raise ValueError
# this will print
# Closing
# Traceback (most recent call last):
# File "dummy.py", line 16, in <module>
# raise ValueError
# ValueError
This is turning into my own loud thinking and I think I get it now. contextlib.closing() basically makes it possible to do what I did there with the __enter__ and __exit__ and it seems the file() built-in has a exit handler that takes care of the closing already so you don't have to do it with any extra decorators.
Trivial but powerful tips for nosetests
19 November 2011
0 comments
Python
I'm clearly still a nosetests beginner because it was only today that I figured out how to set certain plugins to always be on.
First of all you might like these plugins too:
$ pip install rudolf
$ pip install disabledoc
Docs: rudolf and disabledoc
To get these gorgeous little tricks into every run of nosetests edit the file ~/.noserc and add the following:
[nosetests]
with-disable-docstring=1
with-color=1
That should make your life a little easier.
UPDATE:
I've since managed to shoot myself in both legs with messing around with nosetests plugins because I heavily rely on django-nose in Django. Long story short: be careful if you get strange import related errors!
Slides about Kwissle from yesterdays London Python Dojo
08 July 2011
0 comments
Python
/plog/slides-about-kwissle-lpdojo/slides.html
Here are the slides from yesterday's London Python Dojo event.
I presented and demo'ed Kwissle to my fellow Python London friends and focused a lot on the technology but also tried to plug the game a bit.
Having seen that there's a lot of interest in "socket" related web applications about I thought this was a good chance to say that you don't need NodeJS and that tornadio is a great framework for that.
Optimization story involving something silly I call "dict+"
13 June 2011
0 comments
Python, MongoDB
https://gist.github.com/1021777
Here's a little interesting story about using MongoKit to quickly draw items from a MongoDB
So I had a piece of code that was doing a big batch update. It was slow. It took about 0.5 seconds per user and I sometimes had a lot of users to run it for.
The code looked something like this:
for play in db.PlayedQuestion.find({'user.$id': user._id}):
if play.winner == user:
bla()
elif play.draw:
ble()
else:
blu()
Because the model PlayedQuestion contains DBRefs MongoKit will automatically look them up for every iteration in the main loop. Individually very fast (thank you indexes) but because of the number of operations very slow in total. Here's how to make it much faster:
for play in db.PlayedQuestion.collection.find({'user.$id': user._id}):
The problem with this is that you get dict instances for each which is more awkward to work with. I.e. instead of `play.winner` you have use `play['winner'].id`. Here's my solution that makes this a lot easier:
class dict_plus(dict):
def __init__(self, *args, **kwargs):
if 'collection' in kwargs: # excess we don't need
kwargs.pop('collection')
dict.__init__(self, *args, **kwargs)
self._wrap_internal_dicts()
def _wrap_internal_dicts(self):
for key, value in self.items():
if isinstance(value, dict):
self[key] = dict_plus(value)
def __getattr__(self, key):
if key.startswith('__'):
raise AttributeError(key)
return self[key]
...
for play in db.PlayedQuestion.collection.find({'user.$id': user._id}):
play = dict_plus(play)
if play.winner.id == user._id:
bla()
elif play.draw:
ble()
else:
blu()
Now, the whole thing takes 0.01 seconds instead of 0.5. 50 times faster!!
TornadoGists.org - launched and ready!
06 April 2011
1 comment
Python, Tornado
http://tornadogists.org/
Today Felinx Lee and I launched TornadoGists.org which is a site for discussing gists related to Tornado (python web framework open sourced by Facebook).
Everyone in the Tornado community seems to solve similar problems in different ways. Oftentimes, these solutions are just a couple of lines or so and not something you can really turn into a full package with setup.py and everything.
Sharing a snippet of code is a great way to a) help other people and b) to get feedback on your solutions.
The goal is to make it a very open and active project with lots of contributors. I'll be accepting and reviewing all forks but hopefully control will be opened up to all Tornado developers. Also, since the code is quite generic to any open source project Felinx and I might one day port this to rubygists.org or lispgists.org or something like that. After all, Github does all the heavy lifting and we just wrap it up nicely.
More productive than Lisp? Really??!
10 March 2011
0 comments
Python
Erann Gat reveals why he lost his mojo with Lisp
What caught my attention (for busy people who don't want to read the whole email):
I'm currently studying Lisp myself and it's hard. Really hard. I blame it on being spoiled with a programming language that I can work in without having to read the manual. With python's brilliant introspection I can use the interpreter to find out how a library works just by using help() and dir() without even having to read the source code. (not always true of course)
As we're entering the 21st century, the new contender "Usability" is becoming more and more important. Considering that I've now done Python for more than a decade I remind myself one of the reasons I liked it so much; yes, exactly that: Usability.