Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop legacy #53

Open
wants to merge 65 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
fb42792
Update README.md
Feb 13, 2020
79fd3c9
Merge pull request #22 from oborchers/master
Feb 13, 2020
23a95ed
Merge branch 'develop' of https://github.com/oborchers/Fast_Sentence_…
Feb 13, 2020
fe9af44
Code style fix
Feb 13, 2020
db4c021
Added file closing op
Feb 14, 2020
296621a
Fixed lgtm test exclusion
Feb 14, 2020
48333b8
Fixed call to child in super __init__
Feb 14, 2020
8c21256
Added Pooling model
Feb 16, 2020
fdd9c9b
Added todo
Feb 16, 2020
d592410
Fixed n-to-m mapping in np
Feb 16, 2020
a533fa3
Small fix to avg
Feb 16, 2020
83a0c5a
MaxPool work NP code
Feb 16, 2020
5d6e885
Typing changes
Feb 16, 2020
00b33cb
Added pooling
Feb 16, 2020
976df44
Fixed non-negative bug + tests
Feb 16, 2020
fb1d55b
Updated readme
Feb 16, 2020
e8a7bd9
Corrected comments
Feb 16, 2020
5c94d5c
Black formatting
Feb 16, 2020
727efb2
Updated readme
Feb 18, 2020
4f936d8
Updated readme
Feb 18, 2020
cc31045
Added todos
Feb 18, 2020
1b144ac
First MaxPool Implementation
Feb 26, 2020
5feb9f3
Working MaxPooling w2v & ft
Feb 27, 2020
28ecf2f
Changed Date to 2020
Feb 27, 2020
947ebb2
Added GPL_v3
Feb 27, 2020
8a020bd
Added GPL v3
Feb 27, 2020
0f848cb
Changed setup_requires
Feb 27, 2020
902822f
black
Feb 27, 2020
66a9001
Changed setup
Feb 27, 2020
f8d664f
Reformat
Feb 27, 2020
3305b72
Refactoring
Feb 27, 2020
eb0d190
Fixed hpool base + added swrmax
Feb 28, 2020
c5291e1
Renaming i,j,k
Mar 2, 2020
e191b8d
Comments
Mar 2, 2020
14bab9d
Working np/cy w2v/ft pooling & unittests
Mar 2, 2020
65c94f8
Minor syntax changes
Mar 2, 2020
bebde9d
Working hpool + stride all models
Mar 2, 2020
c864bf3
Updated todo
Mar 2, 2020
5ad3a48
Minor
Mar 2, 2020
565ae0e
Updated Comments
Mar 2, 2020
08b7db3
Updated readme
Mar 2, 2020
25d95b0
Memory fix
Mar 3, 2020
f4aa40b
Readme
Mar 3, 2020
d84b225
Minor Fixes
May 22, 2020
b3e766e
todo
May 22, 2020
4fac670
Refactored unittests
May 22, 2020
7649820
black
May 22, 2020
ccebd28
Fixed shared imports
May 22, 2020
d74a138
fixed shared imports
Aug 4, 2020
d8633ab
added todo
Aug 4, 2020
77ab098
updated readme
Aug 4, 2020
158dd48
Changed error message
Aug 4, 2020
b36306d
Updated todo
Aug 4, 2020
1d01d1d
Updated todo average
Aug 4, 2020
f4f3fe5
Readme + todo
Aug 4, 2020
4660937
Updated mail
Aug 5, 2020
885805c
Merged VecsConfigs
Aug 5, 2020
67192c1
Bugfix
Aug 5, 2020
b680e83
Removed comments
Aug 5, 2020
68643a3
todo changed
Aug 5, 2020
9fb4883
First iterator draft
Aug 5, 2020
f858274
Added option to skip Cython tests
Aug 5, 2020
6d20af0
Python base_iterator + numpy pooling working
Aug 6, 2020
bc0e668
changed readme
Aug 6, 2020
ac17451
Last changes
Mar 17, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ Thumbs.db
legacy
latex
draft
drafts
fse.egg-info/

# Other #
Expand Down
61 changes: 61 additions & 0 deletions .gitignore.save
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Compiled source #
###################
*.com
*.class
*.dll
*.exe
*.o
*.so
*.pyc

# Packages #
############
# it's better to unpack these files and commit the raw source
# git has its own built in compression methods
*.7z
*.dmg
*.gz
*.iso
*.jar
*.rar
*.tar
*.zip

# Logs and databases #
######################
*.log
*.sql
*.sqlite
*.pkl
*.bak
*.npy
*.npz
*.code-workspace

# OS generated files #
######################
.DS_Store?
.DS_Store
ehthumbs.db
Icon?
Thumbs.db
*.icloud

# Folders #
###########
legacy

# Other #
#########
.ipynb_checkpoints/
.settings/
.vscode/
.eggs
.coverage
*.bak
/build/
/dist/
*.prof
*.lprof
*.bin
*.old
2 changes: 1 addition & 1 deletion .lgtm.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
path_classifiers:
test:
- exclude: "**/test_*"
- test

extraction:
python:
Expand Down
4 changes: 3 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,6 @@ include README.md
include fse/models/voidptr.h

include fse/models/average_inner.pyx
include fse/models/average_inner.pxd
include fse/models/average_inner.pxd

include fse/models/pooling_inner.pyx
101 changes: 58 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,33 +2,33 @@
<a href="https://travis-ci.com/oborchers/Fast_Sentence_Embeddings"><img alt="Build Status" src="https://travis-ci.com/oborchers/Fast_Sentence_Embeddings.svg?branch=master"></a>
<a href="https://coveralls.io/github/oborchers/Fast_Sentence_Embeddings?branch=master"><img alt="Coverage Status" src="https://coveralls.io/repos/github/oborchers/Fast_Sentence_Embeddings/badge.svg?branch=master"></a>
<a href="https://pepy.tech/project/fse"><img alt="Downloads" src="https://pepy.tech/badge/fse"></a>
<a href="https://lgtm.com/projects/g/oborchers/Fast_Sentence_Embeddings/context:python"><img alt="Language grade: Python" src="https://img.shields.io/lgtm/grade/python/g/oborchers/Fast_Sentence_Embeddings.svg"></a>
<a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
<a href="https://img.shields.io/github/license/oborchers/Fast_Sentence_Embeddings.svg?style=flat"><img alt="License: GPL3" src="https://img.shields.io/github/license/oborchers/Fast_Sentence_Embeddings.svg?style=flat"></a>
</p>

Fast Sentence Embeddings (fse)
==================================

Fast Sentence Embeddings is a Python library that serves as an addition to Gensim. This library is intended to compute *sentence vectors* for large collections of sentences or documents.

If you want to support fse, take a quick [survey](https://forms.gle/8uSU323fWUVtVwcAA) to improve it :-)

*For additional features, please check out the dev branch. There is no development on the master branch*

Features
------------

Find the corresponding blog post(s) here:

- [Visualizing 100,000 Amazon Products](https://towardsdatascience.com/vis-amz-83dea6fcb059)
- [Sentence Embeddings. Fast, please!](https://towardsdatascience.com/fse-2b1ffa791cf9)

- **Announcment: Please understand, that I am at the end of my PhD and I do not have many free minutes to fix issues or add features.**

**fse** implements three algorithms for sentence embeddings. You can choose
between *unweighted sentence averages*, *smooth inverse frequency averages*, and *unsupervised smooth inverse frequency averages*.
**fse** implements multiple algorithms for sentence embeddings. You can choose
between *unweighted sentence averages*, *smooth inverse frequency averages*, *unsupervised smooth inverse frequency averages*, and *max pooling*. All models support hierarchical estimation, similar to convolutional filters in CNNs.

Key features of **fse** are:

**[X]** Up to 500.000 sentences / second (1)

**[X]** Supports Average, SIF, and uSIF Embeddings
**[X]** Supports Average, SIF, uSIF, and MaxPooling Embeddings

**[X]** All models can be estimated as hierarchical models (with window size and stride)

**[X]** Full support for Gensims Word2Vec and all other compatible classes

Expand All @@ -50,20 +50,18 @@ Key features of **fse** are:

**[X]** Extensive documentation of all functions

**[X]** Extensive unittest for Linux/OSX

**[X]** Optimized Input Classes

(1) May vary significantly from system to system (i.e. by using swap memory) and processing.
I regularly observe 300k-500k sentences/s for preprocessed data on my Macbook (2016).
Visit **Tutorial.ipynb** for an example.

Things I will work on next:

**[ ]** MaxPooling / Hierarchical Pooling Embedding

**[ ]** Approximate Nearest Neighbor Search for SentenceVectors


Find the corresponding blog post(s) here:

- [Visualizing 100,000 Amazon Products](https://towardsdatascience.com/vis-amz-83dea6fcb059) (Note: The code may be outdated)
- [Sentence Embeddings. Fast, please!](https://towardsdatascience.com/fse-2b1ffa791cf9) (Note: The code may be outdated)

Installation
------------
Expand All @@ -85,9 +83,30 @@ If building the Cython extension fails (you will be notified), try:

pip install -U git+https://github.com/oborchers/Fast_Sentence_Embeddings

To install **fse** on Colab, check out: https://colab.research.google.com/drive/1qq9GBgEosG7YSRn7r6e02T9snJb04OEi

Usage
-------------

In order to use **fse** you must first estimate a Gensim model which contains a
gensim.models.keyedvectors.BaseKeyedVectors class, for example
*Word2Vec* or *Fasttext*. Then you can proceed to compute sentence embeddings
for a corpus.

from gensim.models import FastText
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
ft = FastText(sentences, min_count=1, size=10)

from fse.models import Average
from fse import IndexedList
model = Average(ft)
model.train(IndexedList(sentences))

model.sv.similarity(0,1)

fse offers multi-thread support out of the box. However, for most
applications a **single** *thread will most likely be sufficient*.

Within the folder nootebooks you can find the following guides:

**Tutorial.ipynb** offers a detailed walk-through of some of the most important functions fse has to offer.
Expand All @@ -107,34 +126,14 @@ The models presented are based on
- Deep-averaging embeddings [1]
- Smooth inverse frequency embeddings [2]
- Unsupervised smooth inverse frequency embeddings [3]
- MaxPooling / Hierarchical MaxPooling [5]

Credits to Radim Řehůřek and all contributors for the **awesome** library
and code that [Gensim](https://github.com/RaRe-Technologies/gensim) provides. A whole lot of the code found in this lib is based on Gensim.

In order to use **fse** you must first estimate a Gensim model which contains a
gensim.models.keyedvectors.BaseKeyedVectors class, for example
*Word2Vec* or *Fasttext*. Then you can proceed to compute sentence embeddings
for a corpus.

from gensim.models import FastText
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
ft = FastText(sentences, min_count=1, size=10)

from fse.models import Average
from fse import IndexedList
model = Average(ft)
model.train(IndexedList(sentences))

model.sv.similarity(0,1)

fse offers multi-thread support out of the box. However, for most
applications a *single thread will most likely be sufficient*.

To install **fse** on Colab, check out: https://colab.research.google.com/drive/1qq9GBgEosG7YSRn7r6e02T9snJb04OEi

Results
------------

Note: Though some models perform very good on the sentence-similarty-task (STS), this does not imply good performance in other donwstream tasks!

Model | [STS Benchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results)
:---: | :---:
`CBOW-Paranmt` | **79.85**
Expand All @@ -156,6 +155,19 @@ Model | [STS Benchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Re
Changelog
-------------

0.1.16 from 0.1.15:
- Added Hierarchical (Convolutional) Embeddings for all Models
- Added MaxPooling
- Added Features to Sentencevectors
- Added further unittests
- Workaround for Numpy memmap issue (https://github.com/numpy/numpy/issues/13172)
- Bugfixes for python 3.8 builds
- Code refactoring to black style
- SVD ram subsampling for SIF / uSIF (customizable, standard is 1 GB of RAM)
- Minor fixes for nan-handling
- Minor fixes for sentencevectors class
- Changed License

0.1.15 from 0.1.11:
- Fixed major FT Ngram computation bug
- Rewrote the input class. Turns out NamedTuple was pretty slow.
Expand All @@ -181,13 +193,16 @@ Proceedings of the 3rd Workshop on Representation Learning for NLP. (Toulon, Fra

4. Eneko Agirre, Daniel Cer, Mona Diab, Iñigo Lopez-Gazpio, Lucia Specia. Semeval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. Proceedings of SemEval 2017.

5. Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan Li, Ricardo Henao, Lawrence Carin (2018) Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. ACL 2018.


Copyright
-------------

Author: Oliver Borchers <borchers@bwl.uni-mannheim.de>
Author: Oliver Borchers <oliver-borchers@outlook.de>

Copyright (C) 2020 Oliver Borchers

Copyright (C) 2019 Oliver Borchers

Citation
-------------
Expand All @@ -197,7 +212,7 @@ If you found this software useful, please cite it in your publication.
@misc{Borchers2019,
author = {Borchers, Oliver},
title = {Fast sentence embeddings},
year = {2019},
year = {2020},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/oborchers/Fast_Sentence_Embeddings}},
Expand Down
22 changes: 11 additions & 11 deletions fse/inputs.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

# Author: Oliver Borchers <borchers@bwl.uni-mannheim.de>
# Copyright (C) 2019 Oliver Borchers
# Author: Oliver Borchers <oliver-borchers@outlook.de>
# Copyright (C) 2020 Oliver Borchers

from typing import MutableSequence

Expand Down Expand Up @@ -150,6 +150,9 @@ def __init__(self, *args:[list, set, ndarray], custom_index:[list, ndarray]):
"""
self.custom_index = custom_index

if len(args) > 1:
RuntimeError("Argument merging not supported")

super(CIndexedList, self).__init__(*args)

if len(self.items) != len(self.custom_index):
Expand All @@ -176,9 +179,6 @@ def insert(self, i:int, item:str):

def append(self, item:str):
raise NotImplementedError("Method currently not supported")

def extend(self, arg:[list, set, ndarray]):
raise NotImplementedError("Method currently not supported")

class SplitIndexedList(BaseIndexedList):

Expand Down Expand Up @@ -220,6 +220,9 @@ def __init__(self, *args:[list, set, ndarray], custom_index:[list, ndarray]):
"""
self.custom_index = custom_index

if len(args) > 1:
RuntimeError("Argument merging not supported")

super(SplitCIndexedList, self).__init__(*args)

if len(self.items) != len(self.custom_index):
Expand Down Expand Up @@ -248,9 +251,6 @@ def insert(self, i:int, item:str):
def append(self, item:str):
raise NotImplementedError("Method currently not supported")

def extend(self, arg:[list, set, ndarray]):
raise NotImplementedError("Method currently not supported")

class CSplitIndexedList(BaseIndexedList):

def __init__(self, *args:[list, set, ndarray], custom_split:callable):
Expand Down Expand Up @@ -296,6 +296,9 @@ def __init__(self, *args:[list, set, ndarray], custom_split:callable, custom_ind
"""
self.custom_split = custom_split
self.custom_index = custom_index

if len(args) > 1:
RuntimeError("Argument merging not supported")

super(CSplitCIndexedList, self).__init__(*args)

Expand Down Expand Up @@ -323,9 +326,6 @@ def insert(self, i:int, item:str):

def append(self, item:str):
raise NotImplementedError("Method currently not supported")

def extend(self, arg:[list, set, ndarray]):
raise NotImplementedError("Method currently not supported")

class IndexedLineDocument(object):

Expand Down
1 change: 1 addition & 0 deletions fse/models/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .average import Average
from .sif import SIF
from .usif import uSIF
from .pooling import MaxPooling
from .sentencevectors import SentenceVectors
Loading