Spellcorrector
18th of April 2007
I think a lot of Python people have seen Peter Novig's beautiful article about How to Write a Spelling Corrector. So have I and couldn't wait to write my own little version of it to fit my needs.
The changes I added were:
- Python 2.4 compatible
- Uses a pickleable
dictinstead of a collection - Compiled a huge list of Swedish words
- Skipped edit distances 2 of words longer than 10 characters
- Added a function
suggestions() - All Unicode instead
- A class instead of a function
- Ability to train on your own words and to save that training persistently
If you're still reading at this point it's quite likely that you're a coder so you'll prefer code to see how it works:
>>> from spellcorrector import Spellcorrector
>>> sc = Spellcorrector('en')
>>> sc.correct('caracter')
u'character'
>>> sc.correct(u'caracter')
u'character'
>>> sc.suggestions(u'caracter')
[u'character']
>>> sc.suggestions(u'spell')
[u'smell', u'shell', u'sell', u'spell', u'swell', u'spill', u'spells']
>>> sc.suggestions(u'spel')
[u'spell', u'sped']
>>> sc.suggestions(u'spel', detailed=True)
[{'count': 9, 'percentage': 90.0, 'word': u'spell'}, \
{'count': 1, 'percentage': 10.0, 'word': u'sped'}]
>>> # Physics database usage example
...
>>> sc.correct('Planck')
u'black'
>>> sc.correct('Curie')
u'sure'
>>> sc.train(['Planck','Curie','Einstein','Heisenberg'])
>>> sc.correct('Planck')
u'planck'
>>> sc.correct('curie')
u'curie'
>>> sc.save('Physicist_words.txt')
>>> del sc
>>> file('Physicist_words.txt').read()
'planck\ncurie\neinstein\nheisenberg'
A lot more can probably be done to improve it but it works quite well as a foundation to an application that mimics Google's "Did you mean: ..." feature.
I've actually already implemented this on a search feature of a not-yet-launched website for art. Since the art site contains non-English names like "Corneille", "Doucet" or "Belartio" I had to train my spellcorrector for that particular application so that a perfectly fine search for "attentif" didn't become "Did you mean: _attentive_".
I'll blog more about that application once I get it up and running on a public domain.
To take this early code experiment for a spin download: spellcorrector-0.1.2.tar.bz2 (6.7Mb)
spellcorrector-0.1.4.tar.bz2 (6.7Mb)
spellcorrector-0.1.5.tar.bz2 (6.7Mb)
Comment
Show all 2 commentsCommenting is currently disabled in Mobile version