Skip to content

Fast cosine similarity (vector scoring) ElasticSearch 6.4+ Plugin

Notifications You must be signed in to change notification settings

MHXXX/fast-cosine-similarity

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

alt text

StaySense - Fast Cosine Similarity ElasticSearch Plugin

Extremely fast vector scoring on ElasticSearch 6.4.x+ using vector embeddings.

About StaySense: StaySense is a revolutionary software company creating the most advanced marketing software ever made publicly available for Hospitality Managers in the Vacation Rental and Hotel Industries.

Company Website: http://staysense.com

Fast Elasticsearch Vector Scoring

This Plugin allows you to score Elasticsearch documents based on embedding-vectors, using dot-product or cosine-similarity. Note, this is a linear search approach in its current version. For very large data sets, this is likely not a good choice for realtime search queries.

Depending on cluster size this plugin will likely stop being performant around +1 Million docs.

ElasticSearch is working on some native vector functionality, however, their current solution uses Locality Sensitive Hashing (LSH) approaches. That is to say, their current beta releases all result in approximate results with false-negatives (not all matching records found). If you need exact results, this new feature from the ES team will not be useful to you.

We are working on a solution to close this false-negative gap common with LSH, however we are not communicating a timeline on the delivery of this solution at this time.

Until that time, if you have a large corpus, we would recommend checking out the FAISS (pronounced: Face) project.

General

  • This plugin was ported from This elasticsearch 5.x vector scoring plugin and this discussion and lior-k's original contribution for ElasticSearch 5.5+ to achieve lightning fast result times when searching across millions of documents.
  • This port is for ElasticSearch 6.4+ utilizing the ScoreScript class which was officially split from SearchScript and thus incompatible < 6.4.x

Improvements

  • lior-k's implementation had some confusing variable assignments that did not consistently match with Cosine-Sim's mathematical definition. This has been updated in the code to more accurately reflect the mathematical definition.
  • Null pointer exceptions are now skipped (e.g. a document doesn't have a vector to compare against) allowing queries to complete successfully even in sparse datasets.
  • Ported for latest version of ElasticSearch.
  • Issues and Pull-Requests welcomed!

Elasticsearch version

  • Currently designed for Elasticsearch 6.4.x+
  • Plugin is NOT backwards compatible (see note above about ScoreScript class)
  • Will succesfully build for 6.4.0 and 6.4.1 (latest). Simply modify pom.xml with the correct version then follow maven build steps below.

Maven Build Steps

  • Clone the project
  • mvn package to compile the plugin as a zip file
  • In Elasticsearch run elasticsearch-plugin install file:/PATH_TO_ZIP to install plugin

Why embeddings?

  • Ultimately, by defining the field mapping as a binary value, by storing an embedded version of the vector you are able to take advantage of Lucene's direct API to achieve direct byte access without transformation.
  • When creating the document, Lucene encodes the embedding directly to binary, making read access blazing fast on the search side.
  • Does Lucene do the same with non-embedded vectors? Unsure, but the plugin supports that too if you want to store in [1.2934, -2.0349, ...., .039] format and try!

Usage

Documents

  • Each document you score should have a field containing the base64 representation of your vector. for example:
   {
   	"_id": 1,
   	....
   	"embeddedVector": "v7l48eAAAAA/s4VHwAAAAD+R7I5AAAAAv8MBMAAAAAA/yEI3AAAAAL/IWkeAAAAAv7s480AAAAC/v6DUgAAAAL+wJi0gAAAAP76VqUAAAAC/sL1ZYAAAAL/dyq/gAAAAP62FVcAAAAC/tQRvYAAAAL+j6ycAAAAAP6v1KcAAAAC/bN5hQAAAAL+u9ItAAAAAP4ckTsAAAAC/pmkjYAAAAD+cYpwAAAAAP5renEAAAAC/qY0HQAAAAD+wyYGgAAAAP5WrCcAAAAA/qzjTQAAAAD++LBzAAAAAP49wNKAAAAC/vu/aIAAAAD+hqXfAAAAAP4FfNCAAAAA/pjC64AAAAL+qwT2gAAAAv6S3OGAAAAC/gfMtgAAAAD/If5ZAAAAAP5mcXOAAAAC/xYAU4AAAAL+2nlfAAAAAP7sCXOAAAAA/petBIAAAAD9soYnAAAAAv5R7X+AAAAC/pgM/IAAAAL+ojI/gAAAAP2gPz2AAAAA/3FonoAAAAL/IHg1AAAAAv6p1SmAAAAA/tvKlQAAAAD/I2OMAAAAAP3FBiCAAAAA/wEd8IAAAAL94wI9AAAAAP2Y1IIAAAAA/rnS4wAAAAL9vriVgAAAAv1QxoCAAAAC/1/qu4AAAAL+inZFAAAAAv7aGA+AAAAA/lqYVYAAAAD+kNP0AAAAAP730BiAAAAA="
   }
  • Use this field mapping:
      "embeddedVector": {
        "type": "binary",
        "doc_values": true
      }
  • The vector can be of any dimension

Converting a vector to Base64

to convert an array of doubles to a base64 string we use these example methods:

Java

public static final String convertArrayToBase64(double[] array) {
	final int capacity = 8 * array.length;
	final ByteBuffer bb = ByteBuffer.allocate(capacity);
	for (int i = 0; i < array.length; i++) {
		bb.putDouble(array[i]);
	}
	bb.rewind();
	final ByteBuffer encodedBB = Base64.getEncoder().encode(bb);
	return new String(encodedBB.array());
}

public static double[] convertBase64ToArray(String base64Str) {
	final byte[] decode = Base64.getDecoder().decode(base64Str.getBytes());
	final DoubleBuffer doubleBuffer = ByteBuffer.wrap(decode).asDoubleBuffer();

	final double[] dims = new double[doubleBuffer.capacity()];
	doubleBuffer.get(dims);
	return dims;
}

Python

import base64
import numpy as np

dbig = np.dtype('>f8')

def decode_float_list(base64_string):
    bytes = base64.b64decode(base64_string)
    return np.frombuffer(bytes, dtype=dbig).tolist()

def encode_array(arr):
    base64_str = base64.b64encode(np.array(arr).astype(dbig)).decode("utf-8")
    return base64_str

Go

import(
    "math"
    "encoding/binary"
    "encoding/base64"
)
func convertArrayToBase64(array []float64) string {
	bytes := make([]byte, 0, 8*len(array))
	for _, a := range array {
		bits := math.Float64bits(a)
		b := make([]byte, 8)
		binary.BigEndian.PutUint64(b, bits)
		bytes = append(bytes, b...)
	}
 	encoded := base64.StdEncoding.EncodeToString(bytes)
	return encoded
}
func convertBase64ToArray(base64Str string) ([]float64, error) {
	decoded, err := base64.StdEncoding.DecodeString(base64Str)
	if err != nil {
		return nil, err
	}
 	length := len(decoded)
	array := make([]float64, 0, length/8)
 	for i := 0; i < len(decoded); i += 8 {
		bits := binary.BigEndian.Uint64(decoded[i : i+8])
		f := math.Float64frombits(bits)
		array = append(array, f)
	}
	return array, nil
}

Querying

Querying with encodings

  • Query for documents based on their cosine similarity:

    For ES 6.4.x:

{
  "query": {
    "function_score": {
    "boost_mode" : "replace",
        "functions": [
          {
            "script_score": {
              "script": {
                  "source": "staysense",
                  "lang" : "fast_cosine",
                  "params": {
                      "field": "embeddedVector",
                      "cosine": true,
                      "encoded_vector" : "v+kopYAAAAA/wivkYAAAAD+wfJeAAAAAv8DL4QAAAAA/waYiwAAAAL+zAmvAAAAAv8c+aiAAAAC/07MyQAAAAL+ccr9AAAAAP9feCOAAAAC/y+ivYAAAAL/R34XgAAAAv+G8nuAAAAA/09hlwAAAAL/MkSWAAAAAP9EXn4AAAAC/zBBxYAAAAD/UY+3AAAAAP7zQSkAAAAC/zRijgAAAAA=="
                  }
              }
            }
          }
        ]
    }
  }
}
  • The example above shows a vector of 64 dimensions
  • Parameters:
    1. field: The document field containing the base64 vector to compare against.
    2. cosine: Boolean. if true - use cosine-similarity, else use dot-product.
    3. encoded_vector: The encoded vector to compare to.

Querying with vectors

  • Query for documents based on their cosine similarity:

    For ES 6.4.x:

{
  "query": {
    "function_score": {
    "boost_mode" : "replace",
        "functions": [
          {
            "script_score": {
              "script": {
                  "source": "staysense",
                  "lang" : "fast_cosine",
                  "params": {
                      "field": "embeddedVector",
                      "cosine": true,
                      "vector" : [
                      -0.09217305481433868, 0.010635560378432274, -0.02878434956073761, ... , 0.08279753476381302
                      ]
                  }
              }
            }
          }
        ]
    }
  }
}
  • The example above shows a vector of 64 dimensions
  • Parameters:
    1. field: The document field containing the base64 vector to compare against.
    2. cosine: Boolean. if true - use cosine-similarity, else use dot-product.
    3. vector: The comma separated non-encoded vector to compare to.

About

Fast cosine similarity (vector scoring) ElasticSearch 6.4+ Plugin

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%