Constructor
new BasicPreprocessor(splitOn, stopWords, punctuation) → {this}
Creates a new basic preprocessor.
Parameters:
Name | Type | Description |
---|---|---|
splitOn |
RegExp | What to split terms on. Default is any whitespace. |
stopWords |
array | A list of common words to be skipped. Default is `[]`. |
punctuation |
RegExp | Any punctuation to be removed. Default is all common symbols on an English QWERTY keyboard. |
Returns:
- Type
- this
Methods
appendTerm(terms, currentWord, wordOffset) → {array}
Potentially appends a new term to the terms list.
This is largely an _internal_ method.
This lowercases, then cleans the word. If there are any characters left
post-cleaning, it will create a new `TermPosition`, and append it to
the `terms` list **IN-PLACE**.
Parameters:
Name | Type | Description |
---|---|---|
terms |
array | The existing term list. |
currentWord |
string | The word to added. |
wordOffset |
int | The offset of the word within the document. |
Returns:
- Type
- array
clean(word) → {string}
Cleans a word.
Currently, this is just stripping out basic punctuation characters.
Parameters:
Name | Type | Description |
---|---|---|
word |
string | The word to be cleaned. |
Returns:
- Type
- string
process(doc) → {array}
Processes a document into a list of terms (`TermPosition` objects).
Parameters:
Name | Type | Description |
---|---|---|
doc |
string | The text to preprocess for the engine. |
Returns:
- Type
- array