Class: BasicPreprocessor

nanosearch~BasicPreprocessor(splitOn, stopWords, punctuation) → {this}

A basic preprocessor. Takes a document of text & generates a list of words. Functionally, this essentially does the following: - converts the document to lowercase - strips out non-alphanumeric symbols - splits on whitespace

Constructor

new BasicPreprocessor(splitOn, stopWords, punctuation) → {this}

Creates a new basic preprocessor.
Parameters:
Name Type Description
splitOn RegExp What to split terms on. Default is any whitespace.
stopWords array A list of common words to be skipped. Default is `[]`.
punctuation RegExp Any punctuation to be removed. Default is all common symbols on an English QWERTY keyboard.
Source:
Returns:
Type
this

Methods

appendTerm(terms, currentWord, wordOffset) → {array}

Potentially appends a new term to the terms list. This is largely an _internal_ method. This lowercases, then cleans the word. If there are any characters left post-cleaning, it will create a new `TermPosition`, and append it to the `terms` list **IN-PLACE**.
Parameters:
Name Type Description
terms array The existing term list.
currentWord string The word to added.
wordOffset int The offset of the word within the document.
Source:
Returns:
Type
array

clean(word) → {string}

Cleans a word. Currently, this is just stripping out basic punctuation characters.
Parameters:
Name Type Description
word string The word to be cleaned.
Source:
Returns:
Type
string

process(doc) → {array}

Processes a document into a list of terms (`TermPosition` objects).
Parameters:
Name Type Description
doc string The text to preprocess for the engine.
Source:
Returns:
Type
array