Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corpus × corpus distances #47

Open
Witiko opened this issue Feb 2, 2019 · 3 comments
Open

Corpus × corpus distances #47

Witiko opened this issue Feb 2, 2019 · 3 comments

Comments

@Witiko
Copy link

Witiko commented Feb 2, 2019

Have you considered speed improvements that might result from computing distances between two corpora (queries and document collection)? With cosine similarity, this is simply a dot product between two term-document matrices. With network flows, perhaps a large network, where the same words in different documents would be distinct nodes, could be constructed? Running a for cycle over nearest_neighbors is not very fast even with the heuristics you implemented.

@vmarkovtsev
Copy link
Collaborator

Thanks for an interesting idea! There can be a problem with building a big network though: EMD calculation chokes on 1,000 and becomes unpractically slow on 2,000 - the cubic complexity is to blame. On the other side, there are tricks to reduce the problem size, and we also can subsample by significance. Are you aware of any papers about this?

@Witiko
Copy link
Author

Witiko commented Feb 2, 2019

I was thinking of computing the subset graph of all document sets (i.e. BOW with ones and zeros) in O(N² / log N) (see Yellin and Jutla, 1993), where N is the sum of set cardinalities, i.e. N = |D| * |V|, where |D| is the number of documents and |V| is the size of the dictionary. By Heap's Law, |V| ≈ √|D|, i.e. the time complexity is O(|D|³ / log |D|). However, this is asymptotically slower than the O(|D|² |V| log |V|) ≈ O(|D|2.5 / log |D|) time currently required to compute all pairwise distances (not to mention that the number of words two documents have in common will be less than |V|), so this still needs more thinking through.

If we had the subset graph, then for every nearest_neighbors call, we could filter out strict subsets of strict subsets of the query, unless this leaves us with less than k documents. This is correct, because a missing word cannot decrease the word mover's distance if the cost is a metric. I do not have an estimate on how many documents this filters out, but intuitively, this would be a large portion of D.

@Witiko
Copy link
Author

Witiko commented Feb 8, 2019

Although an explicit subset graph construction seems unfeasible, there is a linear algebra procedure, which allows us to filter out strict subsets of strict subsets of the query (see paragraph 2 of above post):

Take sign(A)T · sign(B), where A and B are term-document matrices, divide each column of the result by (∑j sign(Aij))iT, assign 1 to cells where division by zero takes place, floor the resulting matrix, and let the result be M. Take sign(A)T · sign(B), divide each row of the result by (∑j sign(Bij))iT, assign 1 to cells where division by zero (0 / 0) takes place, floor the resulting matrix, and let the result be N. Then a document j in B is a subset of document i in A iff Nij = 1 and a strict subset iff Mij = 0 and Nij = 1.

Let Q and C be the term-document matrices for the queries and for the collection documents, respectively. Computing M and N twice, once for A=Q and B=C and again for A=C and B=C, gives the worst-case time complexity O(|D|2|V|) ≈ O(|D|2.5). In theory, this is still not an improvement over O(|D|2 |V| log |V|) ≈ O(|D|2.5 / log |D|), although an optimized implementation of matrix dot product can well make this worthwhile.

I would like to benchmark this in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants