Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(nlp): add basic nlp transformations #40

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions .idea/feature-fabrica.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 7 additions & 0 deletions .idea/inspectionProfiles/profiles_settings.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions .idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/vcs.xml
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these files?

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

115 changes: 115 additions & 0 deletions feature_fabrica/transform/NLP.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# ngrams.py

import nltk
import numpy as np
from beartype import beartype
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.util import ngrams
from omegaconf import ListConfig
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from feature_fabrica.transform.base import Transformation
from feature_fabrica.transform.utils import NumericArray, StrArray, StrValue

nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')



class NGrams(Transformation):
_name_ = "NGrams"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase

@beartype
def __init__(self, n: int):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow to set other parameters

super().__init__()
self.n = n

@beartype
def execute(self, data: StrArray | StrValue) -> StrArray:

Check warning on line 28 in feature_fabrica/transform/NLP.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

feature_fabrica/transform/NLP.py#L28

Number of parameters was 2 in 'Transformation.execute' and is now 2 in overriding 'NGrams.execute' method
if isinstance(data, str):
return np.array(['_'.join(gram) for gram in ngrams(data.split(), self.n)])
else:
return np.array([' '.join(['_'.join(gram) for gram in list(ngrams(text.split(), self.n))]) for text in data])


class Stemming(Transformation):
_name_ = "Stemming"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase

@beartype
def __init__(self):
super().__init__()
self.stemmer = PorterStemmer()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow to choose other stemmers


@beartype
def execute(self, data: StrArray | StrValue) -> StrArray | StrValue:

Check warning on line 43 in feature_fabrica/transform/NLP.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

feature_fabrica/transform/NLP.py#L43

Number of parameters was 2 in 'Transformation.execute' and is now 2 in overriding 'Stemming.execute' method
if isinstance(data, str):
return ' '.join([self.stemmer.stem(word) for word in data.split()])
else:
return np.array([' '.join([self.stemmer.stem(word) for word in text.split()]) for text in data])


class Lemmatization(Transformation):
_name_ = "Lemmatization"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase

def __init__(self):
super().__init__()
self.lemmatizer = WordNetLemmatizer()

@beartype
def execute(self, data: StrArray | StrValue) -> StrArray | StrValue:
if isinstance(data, str):
return self._lemmatize_sentence(data)
else:
return np.array([self._lemmatize_sentence(text) for text in data])

def _lemmatize_sentence(self, sentence: str) -> str:
words = sentence.split()
pos_tags = nltk.pos_tag(words)

lemmatized_words = [
self.lemmatizer.lemmatize(word, self._get_wordnet_pos(tag)) for word, tag in pos_tags
]
return ' '.join(lemmatized_words)

def _get_wordnet_pos(self, treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN


class TFIDF(Transformation):
_name_ = "TFIDF"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase

@beartype
def __init__(self, max_features: int, ngram_range: tuple[int, int], stop_words: list[str] | None = None):
super().__init__()
self.vectorizer = TfidfVectorizer(max_features=max_features, ngram_range=ngram_range, stop_words=stop_words)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow to set all parameters


@beartype
def execute(self, data: StrArray) -> NumericArray:

Check warning on line 93 in feature_fabrica/transform/NLP.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

feature_fabrica/transform/NLP.py#L93

Number of parameters was 2 in 'Transformation.execute' and is now 2 in overriding 'TFIDF.execute' method
return self.vectorizer.fit_transform(data).toarray()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't apply fit_transform on test data. should be separate fit & transform



class BagOfWords(Transformation):
_name_ = 'BagOfWords'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase

@beartype
def __init__(self, max_features: int, ngram_range: tuple[int, int] | ListConfig):
super().__init__()
self.max_features = max_features

# If ngram_range is a ListConfig, convert it to a tuple
if isinstance(ngram_range, ListConfig):
ngram_range = tuple(ngram_range)

self.ngram_range = ngram_range
self.vectorizer = CountVectorizer(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow to set all parameters

max_features=self.max_features, ngram_range=self.ngram_range
)

@beartype
def execute(self, data: StrArray) -> np.ndarray:

Check warning on line 114 in feature_fabrica/transform/NLP.py

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

feature_fabrica/transform/NLP.py#L114

Number of parameters was 2 in 'Transformation.execute' and is now 2 in overriding 'BagOfWords.execute' method
return self.vectorizer.fit_transform(data).toarray()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't apply fit_transform on test data. should be separate fit & transform

Loading
Loading