Spell checker

pocketsearch provides a simple implementation of a SpellChecker that can be used to correct misspelled tokens in a query. By default spell checking is turned off. In order to support spellchecking, you have to to setup your schema as follows:

Enabling spell checking

from pocketsearch import Schema, Text

class Example(Schema):

    class Meta:
        spell_check = True

    title = Text(index=True) # spellchecked
    body = Text(index=True) # spellchecked
    category = Text() # not spellchecked, as it is not part of the fulltext-search index

A spell checking dicitionary is built based on fields where index is set to True.

import pocketsearch

with pocketsearch.PocketWriter(schema=Example,db_name="my_db.db") as pocket_writer:
    pocket_writer.insert(title="Hello",body="World",category="Default")
    # Build spell checking dictionary:
    pocketwriter.spell_checker().build()

The spell checking dictionary is always built from scratch (thus any previous entries are deleted) when the .build method is invoked.

Using spellchecking

We can now open the search index again and use spelling suggestions:

import pocketsearch
# now the database is written and the spellchecker is available
with pocketsearch.PocketReader(schema=Example,db_name="my_db.db") as pocket_reader:
    pocket_reader.suggest("hllo") 

.suggest returns a dictionary with possible corrections sorted by their edit distance to the token in the database:

pocket_reader.suggest("hllo")
{'hllo': [('hello', 1)]}
pocket_reader.suggest("wrld") 
{'wrld': [('world', 1)]}
pocket_reader.suggest("hllo wrld")
{'hllo': [('hello', 1)],'wrld': [('world', 1)]}

Spellchecking is done as follows:

  • A separate pocketsearch instance is built in the background

  • The token table of the original pocketsearch instance is scanned and tokens are divided into bigrams

  • Bigrams are stored in the spellchecker dictionary

  • .suggest tokenizes the query and splits each token into bigrams

  • . suggest will then search the bigrams order them by rank and additionally calculate the Levensthein distance retrieving top 10 suggestions.