TODO merge in notes from 6.864 shingles markov chain pos tagging logistic stemming porter stemming: exclusion lists
general NLP problems
- POS tagging
- sentence detection
- tokenization/segmentation: figuring out word/sentence boundaries
- simplest: over all possible segmentations, return seg w max prod over all words of P(word)
- chunking
- parsing
- named entity recognition
- coreference
misc
-
$n$ -grams of chars: useful for language guessing
term frequency inverse document frequency (TFIDF)
-
term frequency is num occurrences of term
$t_i$ in doc$d_j$ over total num words in$d_j$ (how prominent$t_i$ is in$d_j$ ):$$ tf_{i,j} = \frac{ n_{i,j} }{ \sum_k n_{k,j} } $$
-
inverse document frequency is num docs over num docs with
$t_i$ (more common words are less important):$$ idf_i = \log \frac{ |D| }{ | { d : t_i \in d } | } $$
-
tf-idf = tf * idf
soundex algorithm
- rudemintary "sounds-alike" phonetic algorithm for encoding things that sound similar to the same representation, for matching
vector space model
- represent text docs as vectors of identifiers
- eg index terms: each dim corresponds to a term; some possible values are:
- bits indicating presence
- counts
- tf-idf
- relevancy rankings: treat query as doc, find most cosine-similar doc
- limitations
- long docs poorly represented, have poor similarity values (small scalar product/numerator and large dimensionality/denominator)
- order information not preserved
cosine similarity
$\frac{A \cdot B}{|A| |B|}$ - intuitively, find angle between vector representations of docs
- don't actually need to calc cosine; can just leave as is; ordering preserved
-
$-1$ means opposite, 0 means same, 1 means identical; in IR, components are always non-neg, so range is [0,1]
jaccard similarity/coefficient/index
- for binary features
- simple:
$J(A,B) = \frac{|A \cap B|}{|A \cup B|}$ - extended jaccard coefficient aka tanimoto coefficient:
$T(A,B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B}$ - extended cosine similarity that yields jaccard for binary features
NLP at Google (Katja Filippova)
- NLP problems at google: translation, speech recognition, language modeling,
information extraction
- 7-gram language models on 2T+ tokens using simplified smoothing
- graph-based multi-sentence (sentence fusion) summarization
- each unique (token,pos) is a vertex; also start & end vertices
- merge graphs of multiple sentences together (merging vertices may require their neighbors to also be the same)
- edges are inverse frequencies of transitions (bigram freqs)
- find
$k$ shortest paths from start to end - constraint: path must contain a verb
- use google news' clusters as dataset
- http://videolectures.net/russir2010_filippova_nlp/