Using available HackerNews dataset produce some insight into the most meaningful topics.
- Most discussed topics (and yearly shift)
- Top technology and startups
- Java
- Deeplearning4J
- Word2vec
- Stanford CoreNLP (lemmatizing)
- Gather data (DONE)
- Produce JSON (DONE)
- For selected terms - related words trending through years 2007 - 2017 (DONE)
- All terms - display counts every year (TODO)
- Term cleanup (DONE)
- Auto build SPA, i.e. push -> CI -> deploy (TODO)
- Fine tune word2vec params (see below)
Project is hosted on GitHub
Help is welcome in fine-tuning word2vec parameters. Here is current setup:
Word2Vec vec = new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(System.currentTimeMillis())
.windowSize(5)
.iterate(iter)
.tokenizerFactory(t)
.build();