resumeanalyser.text_cleaning

Module Contents

Functions

remove_punctuation(text)

Remove punctuation and special characters from the input text.

tokenize(text)

Tokenize the input text into individual words.

to_lower(tokens)

Convert all tokens in the input list to lowercase.

remove_stop_words(tokens)

Remove stop words from the list of tokens.

lemmatize(tokens)

Apply lemmatization to each token in the list.

clean_text(text)

Clean text by applying a series of processing steps: tokenization, converting to lower case,

resumeanalyser.text_cleaning.remove_punctuation(text)[source]

Remove punctuation and special characters from the input text.

Parameters: text (str): A string containing the text to be processed.

Returns: str: The text with all punctuation and special characters removed.

Example: >>> remove_punctuation(“Hello, world!”) ‘Hello world’

resumeanalyser.text_cleaning.tokenize(text)[source]

Tokenize the input text into individual words.

Parameters: text (str): A string containing the text to be tokenized.

Returns: list: A list of words (tokens) extracted from the input text.

Example: >>> tokenize(“Hello, world!”) [‘Hello’, ‘,’, ‘world’, ‘!’]

resumeanalyser.text_cleaning.to_lower(tokens)[source]

Convert all tokens in the input list to lowercase.

Parameters: tokens (list): A list of tokens (words).

Returns: list: A list of tokens in lowercase.

Example: >>> to_lower([‘Hello’, ‘WORLD’]) [‘hello’, ‘world’]

resumeanalyser.text_cleaning.remove_stop_words(tokens)[source]

Remove stop words from the list of tokens.

Parameters: tokens (list): A list of tokens (words).

Returns: list: A list of tokens with stop words removed.

Example: >>> remove_stop_words([‘this’, ‘is’, ‘a’, ‘sample’]) [‘sample’]

resumeanalyser.text_cleaning.lemmatize(tokens)[source]

Apply lemmatization to each token in the list.

Parameters: tokens (list): A list of tokens (words).

Returns: list: A list of lemmatized tokens.

Example: >>> lemmatize([‘running’, ‘jumps’]) [‘running’, ‘jump’]

resumeanalyser.text_cleaning.clean_text(text)[source]

Clean text by applying a series of processing steps: tokenization, converting to lower case, removing stop words, and applying lemmatization.

Parameters: text (str): A string containing the text to be cleaned.

Returns: str: The cleaned text as a single string.

Example: >>> clean_text(“This is a sample sentence, showing off the stop words filtration.”) ‘sample sentence showing stop word filtration’