Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topic modeling #122

Open
herrtao opened this issue Nov 23, 2015 · 5 comments
Open

topic modeling #122

herrtao opened this issue Nov 23, 2015 · 5 comments

Comments

@herrtao
Copy link

herrtao commented Nov 23, 2015

can I use Tethne to do topic modeling for my own txt files, about 700 different files?

@erickpeirson
Copy link
Collaborator

@herrtao Wow, somehow this completely slipped by me -- apologies for not responding.

Tethne is primarily designed for cases where you are starting with bibliographic metadata (e.g. from Web of Science, JSTOR, Zotero). If you're just working with a bunch of plain-text files, then there are potentially simpler approaches.

As a starting-place, you might take a look at the notebooks in this project. There are several different workflows -- in the topic modeling sections, there are notebooks that demonstrate LDA with Tethne/MALLET and gensim. In particular, this notebook demonstrates LDA with gensim -- if you don't have metadata, you can just skip/comment out those parts.

I hope that helps! Let me know if you have any other questions. We can also discuss further off-channel if you'd prefer (erick.peirson@asu.edu).

@erickpeirson
Copy link
Collaborator

This will be TETHNE-131

@erickpeirson
Copy link
Collaborator

@herrtao Take a look at this thread for a related discussion. It's not exactly what you asked, but maybe helpful.

@erickpeirson
Copy link
Collaborator

erickpeirson commented Jul 13, 2016

@herrtao Ok, as of v0.8.1.dev5 this is now a feature! Since this is a pre-release version you'll have to upgrade Tethne with the --pre flag.

pip install -U tethne --pre

Here's an example. Please let me know what you think. If you run into issues, or have other requests, please check out our new Q/A group.

>>> from tethne.readers.plain_text import read
>>> corpus = read('/path/to/directory/with/texts')

To use the corpus for topic modeling, you could then do:

>>> model = LDAModel(corpus, featureset_name='plain_text')
>>> model.fit(Z=5, max_iter=200)

More documentation will be forthcoming, but here's the docstring for now:

Generate a :class:`.Corpus` from a collection of plain-text files.

Plain-text content will be available as a feature set called "plain_text".

Uses :class:`nltk.corpus.reader.plaintext.PlaintextCorpusReader`\.

Parameters
----------
path : str
    Path to a directory containing plain text files.
pattern : str
    (default: '.+\.txt') A RegEx pattern used to select texts for inclusion
    in the corpus. By default will select any file ending in `.txt`.
extractor : function
    This function can be used to parse the name of each file for additional
    metadata. It should accept a single string (the filename), and return
    a dictionary of fields and values. These fields will be added to the
    resulting :class:`.Paper` instance.
index_by : str
    (default: 'fileied') Field on :class:`.Paper` to use as the primary
    index.
structured : bool
    (default: True) If True, the contents of the document collection will be
    represented by a :class:`.StructuredFeatureSet`\. If False, a
    :class:`.FeatureSet` will be used instead. Setting ``structured=False``
    is appropriate if word-order does not matter (e.g. topic modeling).
corpus : bool
    (default: True) If False, will return a list of :class:`.Paper`
    instances rather than a :class:`.Corpus`\.
kwargs : kwargs
    Any additional kwargs will be passed to the
    :class:`nltk.corpus.reader.plaintext.PlaintextCorpusReader` constructor.
    Refer to the `NLTK documentation
    <http://www.nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.plaintext.PlaintextCorpusReader>`_
    for details.

Returns
-------
:class:`.Corpus`

"""

@herrtao
Copy link
Author

herrtao commented Mar 7, 2017

thanks for the reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants