Extremely fast vector scoring on ElasticSearch 6.4.x+ using vector embeddings.
About StaySense: StaySense is a revolutionary software company creating the most advanced marketing software ever made publicly available for Hospitality Managers in the Vacation Rental and Hotel Industries.
Company Website: http://staysense.com
This Plugin allows you to score Elasticsearch documents based on embedding-vectors, using dot-product or cosine-similarity. Note, this is a linear search approach in its current version. For very large data sets, this is likely not a good choice for realtime search queries.
Depending on cluster size this plugin will likely stop being performant around +1 Million docs.
ElasticSearch is working on some native vector functionality, however, their current solution uses Locality Sensitive Hashing (LSH) approaches. That is to say, their current beta releases all result in approximate results with false-negatives (not all matching records found). If you need exact results, this new feature from the ES team will not be useful to you.
We are working on a solution to close this false-negative gap common with LSH, however we are not communicating a timeline on the delivery of this solution at this time.
Until that time, if you have a large corpus, we would recommend checking out the FAISS (pronounced: Face) project.
- This plugin was ported from This elasticsearch 5.x vector scoring plugin and this discussion and lior-k's original contribution for ElasticSearch 5.5+ to achieve lightning fast result times when searching across millions of documents.
- This port is for ElasticSearch 6.4+ utilizing the ScoreScript class which was officially split from SearchScript and thus incompatible < 6.4.x
- lior-k's implementation had some confusing variable assignments that did not consistently match with Cosine-Sim's mathematical definition. This has been updated in the code to more accurately reflect the mathematical definition.
- Null pointer exceptions are now skipped (e.g. a document doesn't have a vector to compare against) allowing queries to complete successfully even in sparse datasets.
- Ported for latest version of ElasticSearch.
- Issues and Pull-Requests welcomed!
- Currently designed for Elasticsearch 6.4.x+
- Plugin is NOT backwards compatible (see note above about ScoreScript class)
- Will succesfully build for 6.4.0 and 6.4.1 (latest). Simply modify pom.xml with the correct version then follow maven build steps below.
- Clone the project
mvn package
to compile the plugin as a zip file- In Elasticsearch run
elasticsearch-plugin install file:/PATH_TO_ZIP
to install plugin
- Ultimately, by defining the field mapping as a binary value, by storing an embedded version of the vector you are able to take advantage of Lucene's direct API to achieve direct byte access without transformation.
- When creating the document, Lucene encodes the embedding directly to binary, making read access blazing fast on the search side.
- Does Lucene do the same with non-embedded vectors? Unsure, but the plugin supports that too if you want to store in [1.2934, -2.0349, ...., .039] format and try!
- Each document you score should have a field containing the base64 representation of your vector. for example:
{
"_id": 1,
....
"embeddedVector": "v7l48eAAAAA/s4VHwAAAAD+R7I5AAAAAv8MBMAAAAAA/yEI3AAAAAL/IWkeAAAAAv7s480AAAAC/v6DUgAAAAL+wJi0gAAAAP76VqUAAAAC/sL1ZYAAAAL/dyq/gAAAAP62FVcAAAAC/tQRvYAAAAL+j6ycAAAAAP6v1KcAAAAC/bN5hQAAAAL+u9ItAAAAAP4ckTsAAAAC/pmkjYAAAAD+cYpwAAAAAP5renEAAAAC/qY0HQAAAAD+wyYGgAAAAP5WrCcAAAAA/qzjTQAAAAD++LBzAAAAAP49wNKAAAAC/vu/aIAAAAD+hqXfAAAAAP4FfNCAAAAA/pjC64AAAAL+qwT2gAAAAv6S3OGAAAAC/gfMtgAAAAD/If5ZAAAAAP5mcXOAAAAC/xYAU4AAAAL+2nlfAAAAAP7sCXOAAAAA/petBIAAAAD9soYnAAAAAv5R7X+AAAAC/pgM/IAAAAL+ojI/gAAAAP2gPz2AAAAA/3FonoAAAAL/IHg1AAAAAv6p1SmAAAAA/tvKlQAAAAD/I2OMAAAAAP3FBiCAAAAA/wEd8IAAAAL94wI9AAAAAP2Y1IIAAAAA/rnS4wAAAAL9vriVgAAAAv1QxoCAAAAC/1/qu4AAAAL+inZFAAAAAv7aGA+AAAAA/lqYVYAAAAD+kNP0AAAAAP730BiAAAAA="
}
- Use this field mapping:
"embeddedVector": {
"type": "binary",
"doc_values": true
}
- The vector can be of any dimension
to convert an array of doubles to a base64 string we use these example methods:
Java
public static final String convertArrayToBase64(double[] array) {
final int capacity = 8 * array.length;
final ByteBuffer bb = ByteBuffer.allocate(capacity);
for (int i = 0; i < array.length; i++) {
bb.putDouble(array[i]);
}
bb.rewind();
final ByteBuffer encodedBB = Base64.getEncoder().encode(bb);
return new String(encodedBB.array());
}
public static double[] convertBase64ToArray(String base64Str) {
final byte[] decode = Base64.getDecoder().decode(base64Str.getBytes());
final DoubleBuffer doubleBuffer = ByteBuffer.wrap(decode).asDoubleBuffer();
final double[] dims = new double[doubleBuffer.capacity()];
doubleBuffer.get(dims);
return dims;
}
Python
import base64
import numpy as np
dbig = np.dtype('>f8')
def decode_float_list(base64_string):
bytes = base64.b64decode(base64_string)
return np.frombuffer(bytes, dtype=dbig).tolist()
def encode_array(arr):
base64_str = base64.b64encode(np.array(arr).astype(dbig)).decode("utf-8")
return base64_str
Go
import(
"math"
"encoding/binary"
"encoding/base64"
)
func convertArrayToBase64(array []float64) string {
bytes := make([]byte, 0, 8*len(array))
for _, a := range array {
bits := math.Float64bits(a)
b := make([]byte, 8)
binary.BigEndian.PutUint64(b, bits)
bytes = append(bytes, b...)
}
encoded := base64.StdEncoding.EncodeToString(bytes)
return encoded
}
func convertBase64ToArray(base64Str string) ([]float64, error) {
decoded, err := base64.StdEncoding.DecodeString(base64Str)
if err != nil {
return nil, err
}
length := len(decoded)
array := make([]float64, 0, length/8)
for i := 0; i < len(decoded); i += 8 {
bits := binary.BigEndian.Uint64(decoded[i : i+8])
f := math.Float64frombits(bits)
array = append(array, f)
}
return array, nil
}
-
Query for documents based on their cosine similarity:
For ES 6.4.x:
{
"query": {
"function_score": {
"boost_mode" : "replace",
"functions": [
{
"script_score": {
"script": {
"source": "staysense",
"lang" : "fast_cosine",
"params": {
"field": "embeddedVector",
"cosine": true,
"encoded_vector" : "v+kopYAAAAA/wivkYAAAAD+wfJeAAAAAv8DL4QAAAAA/waYiwAAAAL+zAmvAAAAAv8c+aiAAAAC/07MyQAAAAL+ccr9AAAAAP9feCOAAAAC/y+ivYAAAAL/R34XgAAAAv+G8nuAAAAA/09hlwAAAAL/MkSWAAAAAP9EXn4AAAAC/zBBxYAAAAD/UY+3AAAAAP7zQSkAAAAC/zRijgAAAAA=="
}
}
}
}
]
}
}
}
- The example above shows a vector of 64 dimensions
- Parameters:
field
: The document field containing the base64 vector to compare against.cosine
: Boolean. if true - use cosine-similarity, else use dot-product.encoded_vector
: The encoded vector to compare to.
-
Query for documents based on their cosine similarity:
For ES 6.4.x:
{
"query": {
"function_score": {
"boost_mode" : "replace",
"functions": [
{
"script_score": {
"script": {
"source": "staysense",
"lang" : "fast_cosine",
"params": {
"field": "embeddedVector",
"cosine": true,
"vector" : [
-0.09217305481433868, 0.010635560378432274, -0.02878434956073761, ... , 0.08279753476381302
]
}
}
}
}
]
}
}
}
- The example above shows a vector of 64 dimensions
- Parameters:
field
: The document field containing the base64 vector to compare against.cosine
: Boolean. if true - use cosine-similarity, else use dot-product.vector
: The comma separated non-encoded vector to compare to.