topic modeling #122

herrtao · 2015-11-23T03:36:21Z

can I use Tethne to do topic modeling for my own txt files, about 700 different files?

erickpeirson · 2016-06-20T21:09:05Z

@herrtao Wow, somehow this completely slipped by me -- apologies for not responding.

Tethne is primarily designed for cases where you are starting with bibliographic metadata (e.g. from Web of Science, JSTOR, Zotero). If you're just working with a bunch of plain-text files, then there are potentially simpler approaches.

As a starting-place, you might take a look at the notebooks in this project. There are several different workflows -- in the topic modeling sections, there are notebooks that demonstrate LDA with Tethne/MALLET and gensim. In particular, this notebook demonstrates LDA with gensim -- if you don't have metadata, you can just skip/comment out those parts.

I hope that helps! Let me know if you have any other questions. We can also discuss further off-channel if you'd prefer (erick.peirson@asu.edu).

erickpeirson · 2016-07-08T18:37:20Z

This will be TETHNE-131

erickpeirson · 2016-07-12T20:47:25Z

@herrtao Take a look at this thread for a related discussion. It's not exactly what you asked, but maybe helpful.

erickpeirson · 2016-07-13T21:41:38Z

@herrtao Ok, as of v0.8.1.dev5 this is now a feature! Since this is a pre-release version you'll have to upgrade Tethne with the --pre flag.

pip install -U tethne --pre

Here's an example. Please let me know what you think. If you run into issues, or have other requests, please check out our new Q/A group.

>>> from tethne.readers.plain_text import read
>>> corpus = read('/path/to/directory/with/texts')

To use the corpus for topic modeling, you could then do:

>>> model = LDAModel(corpus, featureset_name='plain_text')
>>> model.fit(Z=5, max_iter=200)

More documentation will be forthcoming, but here's the docstring for now:

Generate a :class:`.Corpus` from a collection of plain-text files.

Plain-text content will be available as a feature set called "plain_text".

Uses :class:`nltk.corpus.reader.plaintext.PlaintextCorpusReader`\.

Parameters
----------
path : str
    Path to a directory containing plain text files.
pattern : str
    (default: '.+\.txt') A RegEx pattern used to select texts for inclusion
    in the corpus. By default will select any file ending in `.txt`.
extractor : function
    This function can be used to parse the name of each file for additional
    metadata. It should accept a single string (the filename), and return
    a dictionary of fields and values. These fields will be added to the
    resulting :class:`.Paper` instance.
index_by : str
    (default: 'fileied') Field on :class:`.Paper` to use as the primary
    index.
structured : bool
    (default: True) If True, the contents of the document collection will be
    represented by a :class:`.StructuredFeatureSet`\. If False, a
    :class:`.FeatureSet` will be used instead. Setting ``structured=False``
    is appropriate if word-order does not matter (e.g. topic modeling).
corpus : bool
    (default: True) If False, will return a list of :class:`.Paper`
    instances rather than a :class:`.Corpus`\.
kwargs : kwargs
    Any additional kwargs will be passed to the
    :class:`nltk.corpus.reader.plaintext.PlaintextCorpusReader` constructor.
    Refer to the `NLTK documentation
    <http://www.nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.plaintext.PlaintextCorpusReader>`_
    for details.

Returns
-------
:class:`.Corpus`

"""

herrtao · 2017-03-07T14:20:32Z

thanks for the reply!

erickpeirson added the question label Jun 20, 2016

erickpeirson added enhancement documentation and removed question labels Jul 8, 2016

This was referenced Jul 11, 2016

Using Mallet file as corpus for the LDA model #149

Closed

TETHNE-133 can load existing MALLET output into LDAModel #158

Merged

erickpeirson mentioned this issue Jul 13, 2016

TETHNE-131 added a plain-text corpus reader #160

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topic modeling #122

topic modeling #122

herrtao commented Nov 23, 2015

erickpeirson commented Jun 20, 2016

erickpeirson commented Jul 8, 2016

erickpeirson commented Jul 12, 2016

erickpeirson commented Jul 13, 2016 •

edited

Loading

herrtao commented Mar 7, 2017

topic modeling #122

topic modeling #122

Comments

herrtao commented Nov 23, 2015

erickpeirson commented Jun 20, 2016

erickpeirson commented Jul 8, 2016

erickpeirson commented Jul 12, 2016

erickpeirson commented Jul 13, 2016 • edited Loading

herrtao commented Mar 7, 2017

erickpeirson commented Jul 13, 2016 •

edited

Loading