PythonZopeIssueTrackerProduct

Massive improvement on sorting a fat list

http://www.issuetrackerproduct.com/D...ad#IssueTrackerMassContainer

28th of February 2010

IssueTrackerMassContainer is a simple Zope product that is used to put a bunch of IssueTrackerProduct instances into. It doesn't add much apart from a nice looking dashboard that lists all recent issues and then with an AJAX poll it keeps updating automatically.

But what it was doing was it recursively put together all issues across all issue trackers, sorting them and then returning only the first 20. Fine, but once the numbers start to add up it can become a vast sort operation to deal with.

In my local development copy of 814 issues, by the use of pympler and time() I was able to go from 7 Mb taking 2 seconds down to using only 8 Kb and taking 0.05 seconds.

Here's the initial naive version of the code:

     def getRecentIssues(self, since=None, recursive=True, batch_size=20, batch_start=0):
        """ return a list of all the most recent issues """
        issues = self._getAllIssues(self.getRoot())
        if since is not None:
            ... # checking variable since
            issues = [x for x in issues
                      if float(x.getModifyDate()) > since]

        issues.sort(lambda x,y: cmp(y.getModifyDate(), x.getModifyDate()))
        return issues[int(batch_start):int(batch_size)]

So, instead of making a fat list of issue objects, just turn it into a list of the things we really need. The second version:

     def getRecentIssues(self, since=None, recursive=True, batch_size=20, batch_start=0):
        """ return a list of all the most recent issues """
        root = self.getRoot()
        issues = self._getAllIssues(root)
        issues = [(x.getModifyDate(), '/'.join(x.getPhysicalPath())) for x in issues]
        if since is not None:
            ... # checking variable since
            issues = [(t,i) for (t,i) in issues
                      if float(t) > since]

        issues.sort()
        issues.reverse()
        issue_paths = [x[1] for x in issues[int(batch_start):int(batch_size)]]
        return [root.unrestrictedTraverse(p) for p in issue_paths]

The issue method getModifyDate() returns a Zope DateTime instance which is ridiculously nifty datetime implementation but it sucks in memory use and performance. See this blog about how it sucks compared to mxDateTime and standard lib datetime. So, this time, turn it into a float and then sort. Final version:

     def getRecentIssues(self, since=None, recursive=True, batch_size=20, batch_start=0):
        """ return a list of all the most recent issues """
        root = self.getRoot()
        issues = self._getAllIssues(root)
        issues = [(float(x.getModifyDate()), '/'.join(x.getPhysicalPath())) 
                  for x in issues]
        if since is not None:
            ... # checking variable since
            issues = [(t,i) for (t,i) in issues
                      if t > since]

        issues.sort()
        issues.reverse()
        issue_paths = [x[1] for x in issues[int(batch_start):int(batch_size)]]
        return [root.unrestrictedTraverse(p) for p in issue_paths]

And the results for my local copy of 818 issues?:

 Version 1:
    7842736 bytes (7.5 Mb)
    2.1547999382 seconds

 Version 2:
    834880 bytes (0.79 Mb)
    0.210245847702 seconds

 Version 3:
    87448 bytes (85 Kb)
    0.0538010597229 seconds

Granted, Zope will release this memory by the garbage collector but why even let it get that big if you have concurrent hits or if anything gets stuck for longer than necessary. Python 2.4 can free memory used but not return it to the operating system to reuse unless the process dies (this was fixed in Python 2.5).

That's a memory usage improvement of about 90 fold and a speed improvement of about 40 fold.

Fry-IT has huge issuetracker instances and at the time of writing keeps 6668 "active" issues across all projects.



Comment

Show all 6 comments
 

Commenting is currently disabled in Mobile version