-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequential lda optimization #3172
base: develop
Are you sure you want to change the base?
Sequential lda optimization #3172
Conversation
Reduced time for sslm initialization (it was especially critical for large datasets). Removed duplicated code.
5768e96
to
2015bdb
Compare
I've run a test using setup from ldaseqmodel example notebook in gensim docs. As for now, the current time distribution is present in the picture below. Almost all time is spent in scipy Python optimization code. The time which is spent in Cython marked with red rectangles. The time in rectangles (right up corner) is in nanoseconds. The test was done using Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz The raw output of a profiler + file for kcachegrind is on gdrive |
Thanks! Do you have any high-level numbers, on a larger dataset? To get a sense of the scale of these optimizations, now-vs-before. |
I trained both implementations using the first 10 years of the UN General Debate dataset . The code I used:
The results: Old Implementation : ~47 hours |
Thanks for the timings. That's a massive speedup! And are the new-vs-old results comparable, quality-wise? |
Sure... I accidentally rewrote a trained old model :( So for now I just trained with two years of data using the same code + removed stopwords using nltk set. I've put the resulting topics in the excel table . As for me topics are very similar. Regarding the model trained with 10 years of data, let me train it one more time :) I'll back in a couple of days. Meanwhile, I'll find a more formal way to compare two SeqLda models. |
I've trained the model one more time. Pickled models are available on gdrive. I calculated the coherence score for both models as it was done in the docs tutorial. The code I used:
Results:
|
Thanks so much. This is helpful. How can we use the new version? Is it built in the package, or we need to revise any of the codes ourselves? |
Good point. @Sharganov can you fix the bug you found above & tune the memory usage? Let's get your work merged & released! |
@Sharganov Ping |
Thats some impressive work @Sharganov. Any news on whether this is anywhere close to being merged/released? |
Hi everyone, sorry for a long break. I fixed the bug which I mentioned. Also I'll look at memory usage. |
@Sharganov what did you find about the memory? We're planning a new release, and this looks a solid candidate for including if finished up. |
@piskvorky I am planning to look at memory on aprox. 28/02 - 04/03, I'll have a lot of free time on these dates. Hope it fits your plans for release. |
No problem, thanks. If we don't manage to finish the optimization in this release, we can include it in the next. |
This is great. Looking forward to the release! Can this be used in its current state? |
Really looking forward to the faster release, @piskvorky @Sharganov ! Great job! |
…tial_lda_optimization
Hi everyone, any updates here? |
@Sharganov how did your memory optimization go? Looks like there is quite some demand for this feature! |
What's the status? Do you need any help/volunteer to bridge gaps against changes in the code base? |
Please, any updates so far? I really need to enhance the efficiency of the LdaSeqmodel. If we set aside concerns about memory optimization for the time being, would it be possible for me to make local file modifications to load this version? Or does this version have any errors or conflicts with the current one? Thank you for your assistance. |
I think it's fair to say this PR is stale / dead. Unless someone picks it up to push it over the line, it's not happening, sorry. |
So pity. I try to make local file modifications to load this version. I am facing the same problem detailed in #3491 when trying to compile gensim-4.3.2.tar.gz from the source. I've followed the recommended steps, including ensuring all prerequisites are met, and using the python setup.py build_ext --inplace command, but I still encounter compilation errors. Any guidance or suggestions based on that issue would be greatly appreciated. |
Hi team,
This pull request addresses the issue of the slow computational time of LdaSeqModel #1545 . In the current stage, it contains cythonized methods of SSLM and LdaPost classes. It is not the final version, I plan to add the following changes:
sslm_counts_init
function of sslm classchain.sslm_counts_init(topic_obs_variance, topic_chain_variance, sstats)
inLdaSeqModel
class, here class method is used, however I don't know why) + many othersupdate_obs
function ofsslm
class. You can compare the logic in the gensim implementation and Blei's lab implementationBy now the performance improvement is 3-5 times on toy datasets e.g. https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_ldaseqmodel.py. I still haven't tried to benchmark for larger datasets as it takes a while to train original implementation, however, the speed up seems to be much higher based on synthetic data I tried.
Extracted topics of optimized model differ a bit from the original one (gensim tests are passed). The exact places in the code which causes it are
update_zeta
function ofsslm
classnp.negative
at the end ofdf_obs
Everything except these two places works absolutely the same. From my perspective, it is not a bug in the code, just issues with the precision. (I hope so)
@gojomo @mpenkov @piskvorky What do you think about the idea of such changes in gensim in general, code structure and code itself particularly? I plan to finish with the optimization and bug fixes I mentioned above by the end of this month.
P.S. I've created the pull request before ending to be sure that these changes are valuable for the project :)