oborchers · oborchers · Feb 13, 2020 · Feb 13, 2020 · Feb 13, 2020 · Feb 13, 2020
diff --git a/.gitignore b/.gitignore
@@ -46,6 +46,7 @@ Thumbs.db
 legacy
 latex
 draft
+drafts
 fse.egg-info/
 
 # Other #

diff --git a/.gitignore.save b/.gitignore.save
@@ -0,0 +1,61 @@
+# Compiled source #
+###################
+*.com
+*.class
+*.dll
+*.exe
+*.o
+*.so
+*.pyc
+
+# Packages #
+############
+# it's better to unpack these files and commit the raw source
+# git has its own built in compression methods
+*.7z
+*.dmg
+*.gz
+*.iso
+*.jar
+*.rar
+*.tar
+*.zip
+
+# Logs and databases #
+######################
+*.log
+*.sql
+*.sqlite
+*.pkl
+*.bak
+*.npy
+*.npz
+*.code-workspace
+
+# OS generated files #
+######################
+.DS_Store?
+.DS_Store
+ehthumbs.db
+Icon?
+Thumbs.db
+*.icloud
+
+# Folders #
+###########
+legacy
+
+# Other #
+#########
+.ipynb_checkpoints/
+.settings/
+.vscode/
+.eggs
+.coverage
+*.bak
+/build/
+/dist/
+*.prof
+*.lprof
+*.bin
+*.old
diff --git a/.lgtm.yml b/.lgtm.yml
@@ -1,6 +1,6 @@
 path_classifiers:
   test:
-      - exclude: "**/test_*"
+      - test
 
 extraction:
   python:

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -4,4 +4,6 @@ include README.md
 include fse/models/voidptr.h
 
 include fse/models/average_inner.pyx
-include fse/models/average_inner.pxd
+include fse/models/average_inner.pxd
+
+include fse/models/pooling_inner.pyx
diff --git a/README.md b/README.md
@@ -2,33 +2,33 @@
 <a href="https://travis-ci.com/oborchers/Fast_Sentence_Embeddings"><img alt="Build Status" src="https://travis-ci.com/oborchers/Fast_Sentence_Embeddings.svg?branch=master"></a>
 <a href="https://coveralls.io/github/oborchers/Fast_Sentence_Embeddings?branch=master"><img alt="Coverage Status" src="https://coveralls.io/repos/github/oborchers/Fast_Sentence_Embeddings/badge.svg?branch=master"></a>
 <a href="https://pepy.tech/project/fse"><img alt="Downloads" src="https://pepy.tech/badge/fse"></a>
+<a href="https://lgtm.com/projects/g/oborchers/Fast_Sentence_Embeddings/context:python"><img alt="Language grade: Python" src="https://img.shields.io/lgtm/grade/python/g/oborchers/Fast_Sentence_Embeddings.svg"></a>
 <a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
+<a href="https://img.shields.io/github/license/oborchers/Fast_Sentence_Embeddings.svg?style=flat"><img alt="License: GPL3" src="https://img.shields.io/github/license/oborchers/Fast_Sentence_Embeddings.svg?style=flat"></a>
 </p>
 
 Fast Sentence Embeddings (fse)
 ==================================
 
 Fast Sentence Embeddings is a Python library that serves as an addition to Gensim. This library is intended to compute *sentence vectors* for large collections of sentences or documents. 
 
+If you want to support fse, take a quick [survey](https://forms.gle/8uSU323fWUVtVwcAA) to improve it :-)
+
+*For additional features, please check out the dev branch. There is no development on the master branch*
 
 Features
 ------------
 
-Find the corresponding blog post(s) here:
-
-- [Visualizing 100,000 Amazon Products](https://towardsdatascience.com/vis-amz-83dea6fcb059)
-- [Sentence Embeddings. Fast, please!](https://towardsdatascience.com/fse-2b1ffa791cf9)
-
-- **Announcment: Please understand, that I am at the end of my PhD and I do not have  many free minutes to fix issues or add features.**
-
-**fse** implements three algorithms for sentence embeddings. You can choose
-between *unweighted sentence averages*,  *smooth inverse frequency averages*, and *unsupervised smooth inverse frequency averages*. 
+**fse** implements multiple algorithms for sentence embeddings. You can choose
+between *unweighted sentence averages*, *smooth inverse frequency averages*, *unsupervised smooth inverse frequency averages*, and *max pooling*. All models support hierarchical estimation, similar to convolutional filters in CNNs.
 
 Key features of **fse** are: 
 
 **[X]** Up to 500.000 sentences / second (1)
 
-**[X]** Supports Average, SIF, and uSIF Embeddings
+**[X]** Supports Average, SIF, uSIF, and MaxPooling Embeddings
+
+**[X]** All models can be estimated as hierarchical models (with window size and stride)
 
 **[X]** Full support for Gensims Word2Vec and all other compatible classes
 
@@ -50,20 +50,18 @@ Key features of **fse** are:
 
 **[X]** Extensive documentation of all functions
 
+**[X]** Extensive unittest for Linux/OSX
+
 **[X]** Optimized Input Classes
 
 (1) May vary significantly from system to system (i.e. by using swap memory) and processing.
 I regularly observe 300k-500k sentences/s for preprocessed data on my Macbook (2016).
 Visit **Tutorial.ipynb** for an example.
 
-Things I will work on next:
-
-**[ ]** MaxPooling / Hierarchical Pooling Embedding
-
-**[ ]** Approximate Nearest Neighbor Search for SentenceVectors
-
-
+Find the corresponding blog post(s) here:
 
+- [Visualizing 100,000 Amazon Products](https://towardsdatascience.com/vis-amz-83dea6fcb059) (Note: The code may be outdated)
+- [Sentence Embeddings. Fast, please!](https://towardsdatascience.com/fse-2b1ffa791cf9) (Note: The code may be outdated)
 
 Installation
 ------------
@@ -85,9 +83,30 @@ If building the Cython extension fails (you will be notified), try:
 
     pip install -U git+https://github.com/oborchers/Fast_Sentence_Embeddings
 
+To install **fse** on Colab, check out: https://colab.research.google.com/drive/1qq9GBgEosG7YSRn7r6e02T9snJb04OEi 
+
 Usage
 -------------
 
+In order to use **fse** you must first estimate a Gensim model which contains a
+gensim.models.keyedvectors.BaseKeyedVectors class, for example 
+*Word2Vec* or *Fasttext*. Then you can proceed to compute sentence embeddings
+for a corpus.
+
+	from gensim.models import FastText
+	sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
+	ft = FastText(sentences, min_count=1, size=10)
+
+	from fse.models import Average
+	from fse import IndexedList
+	model = Average(ft)
+	model.train(IndexedList(sentences))
+
+	model.sv.similarity(0,1)
+
+fse offers multi-thread support out of the box. However, for most
+applications a **single** *thread will most likely be sufficient*.
+
 Within the folder nootebooks you can find the following guides:
 
 **Tutorial.ipynb** offers a detailed walk-through of some of the most important functions fse has to offer.
@@ -107,34 +126,14 @@ The models presented are based on
 - Deep-averaging embeddings [1]
 - Smooth inverse frequency embeddings [2]
 - Unsupervised smooth inverse frequency embeddings [3]
+- MaxPooling / Hierarchical MaxPooling [5]
 
-Credits to Radim Řehůřek and all contributors for the **awesome** library
-and code that [Gensim](https://github.com/RaRe-Technologies/gensim) provides. A whole lot of the code found in this lib is based on Gensim.
-
-In order to use **fse** you must first estimate a Gensim model which contains a
-gensim.models.keyedvectors.BaseKeyedVectors class, for example 
-*Word2Vec* or *Fasttext*. Then you can proceed to compute sentence embeddings
-for a corpus.
-
-	from gensim.models import FastText
-	sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
-	ft = FastText(sentences, min_count=1, size=10)
-
-	from fse.models import Average
-	from fse import IndexedList
-	model = Average(ft)
-	model.train(IndexedList(sentences))
-
-	model.sv.similarity(0,1)
-
-fse offers multi-thread support out of the box. However, for most
-applications a *single thread will most likely be sufficient*.
-
-To install **fse** on Colab, check out: https://colab.research.google.com/drive/1qq9GBgEosG7YSRn7r6e02T9snJb04OEi 
 
 Results
 ------------
 
+Note: Though some models perform very good on the sentence-similarty-task (STS), this does not imply good performance in other donwstream tasks!
+
 Model | [STS Benchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results)
 :---: | :---:
 `CBOW-Paranmt` | **79.85**
@@ -156,6 +155,19 @@ Model | [STS Benchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Re
 Changelog
 -------------
 
+0.1.16 from 0.1.15:
+- Added Hierarchical (Convolutional) Embeddings for all Models
+- Added MaxPooling
+- Added Features to Sentencevectors
+- Added further unittests
+- Workaround for Numpy memmap issue (https://github.com/numpy/numpy/issues/13172)
+- Bugfixes for python 3.8 builds
+- Code refactoring to black style
+- SVD ram subsampling for SIF / uSIF (customizable, standard is 1 GB of RAM)
+- Minor fixes for nan-handling
+- Minor fixes for sentencevectors class
+- Changed License
+
 0.1.15 from 0.1.11:
 - Fixed major FT Ngram computation bug
 - Rewrote the input class. Turns out NamedTuple was pretty slow. 
@@ -181,13 +193,16 @@ Proceedings of the 3rd Workshop on Representation Learning for NLP. (Toulon, Fra
 
 4. Eneko Agirre, Daniel Cer, Mona Diab, Iñigo Lopez-Gazpio, Lucia Specia. Semeval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. Proceedings of SemEval 2017.
 
+5. Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan Li, Ricardo Henao, Lawrence Carin (2018) Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. ACL 2018.
+
 
 Copyright
 -------------
 
-Author: Oliver Borchers <borchers@bwl.uni-mannheim.de>
+Author: Oliver Borchers <oliver-borchers@outlook.de>
+
+Copyright (C) 2020 Oliver Borchers
 
-Copyright (C) 2019 Oliver Borchers
 
 Citation
 -------------
@@ -197,7 +212,7 @@ If you found this software useful, please cite it in your publication.
 	@misc{Borchers2019,
 		author = {Borchers, Oliver},
 		title = {Fast sentence embeddings},
-		year = {2019},
+		year = {2020},
 		publisher = {GitHub},
 		journal = {GitHub Repository},
 		howpublished = {\url{https://github.com/oborchers/Fast_Sentence_Embeddings}},

diff --git a/fse/inputs.py b/fse/inputs.py
@@ -1,8 +1,8 @@
 #!/usr/bin/env python
 # -*- coding: utf-8 -*-
 
-# Author: Oliver Borchers <borchers@bwl.uni-mannheim.de>
-# Copyright (C) 2019 Oliver Borchers
+# Author: Oliver Borchers <oliver-borchers@outlook.de>
+# Copyright (C) 2020 Oliver Borchers
 
 from typing import MutableSequence
 
@@ -150,6 +150,9 @@ def __init__(self, *args:[list, set, ndarray], custom_index:[list, ndarray]):
         """
         self.custom_index = custom_index
 
+        if len(args) > 1:
+            RuntimeError("Argument merging not supported")
+
         super(CIndexedList, self).__init__(*args)
 
         if len(self.items) != len(self.custom_index):
@@ -176,9 +179,6 @@ def insert(self, i:int, item:str):
 
     def append(self, item:str):
         raise NotImplementedError("Method currently not supported")
-
-    def extend(self, arg:[list, set, ndarray]):
-        raise NotImplementedError("Method currently not supported")
 
 class SplitIndexedList(BaseIndexedList):
 
@@ -220,6 +220,9 @@ def __init__(self, *args:[list, set, ndarray], custom_index:[list, ndarray]):
         """
         self.custom_index = custom_index
 
+        if len(args) > 1:
+            RuntimeError("Argument merging not supported")
+
         super(SplitCIndexedList, self).__init__(*args)
 
         if len(self.items) != len(self.custom_index):
@@ -248,9 +251,6 @@ def insert(self, i:int, item:str):
     def append(self, item:str):
         raise NotImplementedError("Method currently not supported")
 
-    def extend(self, arg:[list, set, ndarray]):
-        raise NotImplementedError("Method currently not supported")
-
 class CSplitIndexedList(BaseIndexedList):
 
     def __init__(self, *args:[list, set, ndarray], custom_split:callable):
@@ -296,6 +296,9 @@ def __init__(self, *args:[list, set, ndarray], custom_split:callable, custom_ind
         """
         self.custom_split = custom_split
         self.custom_index = custom_index
+
+        if len(args) > 1:
+            RuntimeError("Argument merging not supported")
 
         super(CSplitCIndexedList, self).__init__(*args)
 
@@ -323,9 +326,6 @@ def insert(self, i:int, item:str):
 
     def append(self, item:str):
         raise NotImplementedError("Method currently not supported")
-
-    def extend(self, arg:[list, set, ndarray]):
-        raise NotImplementedError("Method currently not supported")
 
 class IndexedLineDocument(object):
 

diff --git a/fse/models/__init__.py b/fse/models/__init__.py
@@ -1,4 +1,5 @@
 from .average import Average
 from .sif import SIF
 from .usif import uSIF
+from .pooling import MaxPooling
 from .sentencevectors import SentenceVectors
-Original file line number
+Diff line change
@@ Expand Up / @@ -46,6 +46,7 @@ Thumbs.db @@
     legacy
     latex
     draft
+    drafts
     fse.egg-info/
     # Other #
@@ Expand Down @@