Dataset Link - https://www.kaggle.com/competitions/quora-question-pairs/data

Data fields

id - the id of a training set question pair
qid1, qid2 - unique ids of each question (only available in train.csv)
question1, question2 - the full text of each question
is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

cwc_min - This is the ratio of the number of common words to the length of the smaller question
cwc_max - This is the ratio of the number of common words to the length of the larger question
csc_min - This is the ratio of the number of common stop words to the smaller stop word count among the two questions
csc_max - This is the ratio of the number of common stop words to the larger stop word count among the two questions
ctc_min - This is the ratio of the number of common tokens to the smaller token count among the two questions
ctc_max - This is the ratio of the number of common tokens to the larger token count among the two questions
last_word_eq - 1 if the last word in the two questions is same, 0 otherwise
first_word_eq - 1 if the first word in the two questions is same, 0 otherwise

mean_len - Mean of the length of the two questions (number of words)
abs_len_diff - Absolute difference between the length of the two questions (number of words)
longest_substr_ratio - Ratio of the length of the longest substring among the two questions to the length of the smaller question

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
streamlit-app		streamlit-app
Initial_EDA.ipynb		Initial_EDA.ipynb
README.md		README.md
bow_with_basic_features.ipynb		bow_with_basic_features.ipynb
bow_with_preprocessing_advance_feature.ipynb		bow_with_preprocessing_advance_feature.ipynb
cv.pkl		cv.pkl
model.pkl		model.pkl
only_bow.ipynb		only_bow.ipynb
stopwords.pkl		stopwords.pkl