- id - the id of a training set question pair
- qid1, qid2 - unique ids of each question (only available in train.csv)
- question1, question2 - the full text of each question
- is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.
- cwc_min - This is the ratio of the number of common words to the length of the smaller question
- cwc_max - This is the ratio of the number of common words to the length of the larger question
- csc_min - This is the ratio of the number of common stop words to the smaller stop word count among the two questions
- csc_max - This is the ratio of the number of common stop words to the larger stop word count among the two questions
- ctc_min - This is the ratio of the number of common tokens to the smaller token count among the two questions
- ctc_max - This is the ratio of the number of common tokens to the larger token count among the two questions
- last_word_eq - 1 if the last word in the two questions is same, 0 otherwise
- first_word_eq - 1 if the first word in the two questions is same, 0 otherwise
- mean_len - Mean of the length of the two questions (number of words)
- abs_len_diff - Absolute difference between the length of the two questions (number of words)
- longest_substr_ratio - Ratio of the length of the longest substring among the two questions to the length of the smaller question
- fuzz_ratio - fuzz_ratio score from fuzzywuzzy
- fuzz_partial_ratio - fuzz_partial_ratio from fuzzywuzzy
- token_sort_ratio - token_sort_ratio from fuzzywuzzy
- token_set_ratio - token_set_ratio from fuzzywuzzy