fix: outdated readme and envd (#42)

Signed-off-by: usamoi <usamoi@outlook.com>
tensorchord · Aug 4, 2023 · f27d6a9 · f27d6a9
1 parent be11548
commit f27d6a9
Show file tree

Hide file tree

Showing 3 changed files with 172 additions and 67 deletions.
diff --git a/README.md b/README.md
@@ -16,84 +16,130 @@ pgvecto.rs is a Postgres extension that provides vector similarity search functi
 - 🦀 **Rewrite in Rust**: Rewriting in Rust offers benefits such as improved memory safety, better performance, and reduced **maintenance costs** over time.
 - 🙋 **Community**: People loves Rust We are happy to help you with any questions you may have. You could join our [Discord](https://discord.gg/KqswhpVgdU) to get in touch with us.
 
-## Installation from Source
+## Installation
 
 <details>
-  <summary>Build from Source</summary>
+  <summary>Build from source</summary>
 
 ### Install Rust and base dependency
+
 ```sh
-apt install -y build-essential libpq-dev libssl-dev pkg-config gcc libreadline-dev flex bison libxml2-dev libxslt-dev libxml2-utils xsltproc zlib1g-dev ccache clang
+sudo apt install -y build-essential libpq-dev libssl-dev pkg-config gcc libreadline-dev flex bison libxml2-dev libxslt-dev libxml2-utils xsltproc zlib1g-dev ccache clang git
 curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
 ```
 
-### Install pgrx (tensorchord's fork)
+### Clone the Repository
+
+```sh
+git clone https://github.com/tensorchord/pgvecto.rs.git
+cd pgvecto.rs
+```
+
+### Install Postgresql and pgrx
+
 ```sh
+sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
+wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
+sudo apt-get update
+sudo apt-get -y install libpq-dev postgresql-15 postgresql-server-dev-15
 cargo install cargo-pgrx --git https://github.com/tensorchord/pgrx.git --rev $(cat Cargo.toml | grep "pgrx =" | awk -F'rev = "' '{print $2}' | cut -d'"' -f1)
-cargo pgrx init
+cargo pgrx init --pg15=/usr/lib/postgresql/15/bin/pg_config
 ```
 
-### Build the extension and config postgres
+### Install pgvecto.rs
+
 ```sh
 cargo pgrx install --release
-psql -U postgres -c 'ALTER SYSTEM SET shared_preload_libraries = "vectors"'
 ```
+
 You need restart your PostgreSQL server for the changes to take effect, like `systemctl restart postgresql.service`.
+
 </details>
 
+<details>
+  <summary>Install from release</summary>
+
+Download the deb package in the release page, and type `sudo apt install vectors-pg15-*.deb` to install the deb package.
+
+</details>
+
+Configure your PostgreSQL by modifying the `shared_preload_libraries` to include `vectors.so`.
+
+```sh
+psql -U postgres -c 'ALTER SYSTEM SET shared_preload_libraries = "vectors.so"'
+```
+
+You need restart the PostgreSQL cluster.
+
+```
+sudo systemctl restart postgresql.service
+```
 
-## Install the extension in postgres
+Connect to the database and enable the extension.
 
 ```sql
--- install the extension
 DROP EXTENSION IF EXISTS vectors;
 CREATE EXTENSION vectors;
--- check the extension related functions
-\df+
 ```
 
-## Get started with pgvecto.rs
+## Get started
+
+pgvecto.rs allows columns of a table to be defined as vectors.
+
+The data type `vector(n)` denotes an n-dimensional vector. The `n` within the brackets signifies the dimensions of the vector. For instance, `vector(1000)` would represent a vector with 1000 dimensions, so you could create a table like this.
+
+```sql
+-- create table with a vector column
+
+CREATE TABLE items (
+  id bigserial PRIMARY KEY,
+  embedding vector(3) NOT NULL
+);
+```
+
+You can then populate the table with vector data as follows.
+
+```sql
+-- insert values
+
+INSERT INTO items (embedding)
+VALUES ('[1,2,3]'), ('[4,5,6]');
+```
 
-We support three operators to calculate the distance between two vectors:
+We support three operators to calculate the distance between two vectors.
 
-- `<->`: square Euclidean distance
-- `<#>`: negative dot product distance
-- `<=>`: negative square cosine distance
+- `<->`: squared Euclidean distance, defined as $\Sigma (x_i - y_i) ^ 2$.
+- `<#>`: negative dot product distance, defined as $- \Sigma x_iy_i$.
+- `<=>`: negative squared cosine distance, defined as $- \frac{(\Sigma x_iy_i)^2}{\Sigma x_i^2 \Sigma y_i^2}$.
 
 ```sql
 -- call the distance function through operators
 
--- square Euclidean distance
+-- squared Euclidean distance
 SELECT '[1, 2, 3]' <-> '[3, 2, 1]';
--- dot product distance
+-- negative dot product distance
 SELECT '[1, 2, 3]' <#> '[3, 2, 1]';
--- cosine distance
+-- negative square cosine distance
 SELECT '[1, 2, 3]' <=> '[3, 2, 1]';
 ```
 
-Note that, "square Euclidean distance" is defined as $ \Sigma (x_i - y_i) ^ 2 $, "negative dot product distance" is defined as $ - \Sigma x_iy_i $, and "negative square cosine distance" is defined as $ - \frac{(\Sigma x_iy_i)^2}{\Sigma x_i^2 \Sigma y_i^2} $, so that you can use `ORDER BY` to perform a KNN search directly without a `DESC` keyword.
-
-### Create a table
-
-You could use the `CREATE TABLE` statement to create a table with a vector column.
+You can search for a vector simply like this.
 
 ```sql
--- create table
-CREATE TABLE items (id bigserial PRIMARY KEY, emb vector(3));
--- insert values
-INSERT INTO items (emb) VALUES ('[1,2,3]'), ('[4,5,6]');
 -- query the similar embeddings
-SELECT * FROM items ORDER BY emb <-> '[3,2,1]' LIMIT 5;
+SELECT * FROM items ORDER BY embedding <-> '[3,2,1]' LIMIT 5;
 -- query the neighbors within a certain distance
-SELECT * FROM items WHERE emb <-> '[3,2,1]' < 5;
+SELECT * FROM items WHERE embedding <-> '[3,2,1]' < 5;
 ```
 
-### Create an index
+### Indexing
 
-You can create an index, using HNSW algorithm and square Euclidean distance with the following SQL.
+You can create an index, using squared Euclidean distance with the following SQL.
 
 ```sql
-CREATE INDEX ON train USING vectors (emb l2_ops)
+-- Using HNSW algorithm.
+
+CREATE INDEX ON items USING vectors (embedding l2_ops)
 WITH (options = $$
 capacity = 2097152
 size_ram = 4294967296
@@ -103,12 +149,10 @@ storage = "ram"
 m = 32
 ef = 256
 $$);
-```
 
-Or using IVFFlat algorithm.
+--- Or using IVFFlat algorithm.
 
-```sql
-CREATE INDEX ON train USING vectors (emb l2_ops)
+CREATE INDEX ON items USING vectors (embedding l2_ops)
 WITH (options = $$
 capacity = 2097152
 size_ram = 2147483648
@@ -120,22 +164,56 @@ nprobe = 10
 $$);
 ```
 
-The index must be built on a vector column. Failure to match the actual vector dimension with the dimension type modifier may result in an unsuccessful index building.
+Now you can perform a KNN search with the following SQL simply.
 
-The operator class determines the type of distance measurement to be used. At present, `l2_ops`, `dot_ops`, and `cosine_ops` are supported.
+```sql
+SELECT *, emb <-> '[0, 0, 0]' AS score
+FROM items
+ORDER BY embedding <-> '[0, 0, 0]' LIMIT 10;
+```
 
-You can specify the indexing and the vectors to be stored in the disk by setting `storage_vectors = "disk"`, and `storage = "disk"`. On this condition, `size_disk` must be specified.
+Please note, vector indexes are not loaded by default when PostgreSQL restarts. To load or unload the index, you can use `vectors_load` and `vectors_unload`.
 
-Now you can perform a KNN search with the following SQL simply.
+```sql
+--- get the index name
+\d items
 
-```SQL
-SELECT *, emb <-> '[0, 0, 0, 0]' AS score FROM items ORDER BY embedding <-> '[0, 0, 0, 0]' LIMIT 10;
+-- load the index
+SELECT vectors_load('items_embedding_idx'::regclass);
 ```
 
 We planning to support more index types ([issue here](https://github.com/tensorchord/pgvecto.rs/issues/17)).
 
 Welcome to contribute if you are also interested!
 
+## Reference
+
+### `vector` type
+
+`vector` and `vector(n)` are all legal data types, where `n` denotes dimensions of a vector.
+
+The current implementation ignores dimensions of a vector, i.e., the behavior is the same as for vectors of unspecified dimensions.
+
+There is only one exception: indexes cannot be created on columns without specified dimensions.
+
+### Indexing
+
+We utilize TOML syntax to express the index's configuration. Here's what each key in the configuration signifies:
+
+| Key                    | Type    | Description                                                                                                           |
+| ---------------------- | ------- | --------------------------------------------------------------------------------------------------------------------- |
+| capacity               | integer | The index's capacity. The value should be greater than the number of rows in your table.                              |
+| size_ram               | integer | (Optional) The maximum amount of memory the persisent part of index can occupy.                                       |
+| size_disk              | integer | (Optional) The maximum amount of disk-backed memory-mapped file size the persisent part of index can occupy.          |
+| storage_vectors        | string  | `ram` ensures that the vectors always stays in memory while `disk` suggests otherwise.                                |
+| algorithm.ivf          | table   | If this table is set, the IVF algorithm will be used for the index.                                                   |
+| algorithm.ivf.storage  | string  | (Optional) `ram` ensures that the persisent part of algorithm always stays in memory while `disk` suggests otherwise. |
+| algorithm.ivf.nlist    | integer | (Optional) Number of cluster units.                                                                                   |
+| algorithm.ivf.nprobe   | integer | (Optional) Number of units to query.                                                                                  |
+| algorithm.hnsw         | table   | If this table is set, the HNSW algorithm will be used for the index.                                                  |
+| algorithm.hnsw.storage | string  | (Optional) `ram` ensures that the persisent part of algorithm always stays in memory while `disk` suggests otherwise. |
+| algorithm.hnsw.m       | integer | (Optional) Maximum degree of the node.                                                                                |
+| algorithm.hnsw.ef      | integer | (Optional) Search scope in building.                                                                                  |
 
 ## Why not a specialty vector database?
 
@@ -148,7 +226,16 @@ Why not just use Postgres to do the vector similarity search? This is the reason
 UPDATE documents SET embedding = ai_embedding_vector(content) WHERE length(embedding) = 0;
 
 -- Create an index on the embedding column
-CREATE INDEX ON documents USING vectors (embedding l2_ops) WITH (algorithm = "HNSW");
+CREATE INDEX ON documents USING vectors (embedding l2_ops)
+WITH (options = $$
+capacity = 2097152
+size_ram = 4294967296
+storage_vectors = "ram"
+[algorithm.hnsw]
+storage = "ram"
+m = 32
+ef = 256
+$$);
 
 -- Query the similar embeddings
 SELECT * FROM documents ORDER BY embedding <-> ai_embedding_vector('hello world') LIMIT 5;

diff --git a/build.envd b/build.envd
@@ -1,24 +1,8 @@
-# syntax=v1
-
-envdlib = include("https://github.com/tensorchord/envdlib")
-
 def build():
-    base(dev=True)
-    install.apt_packages(name=[
-        "clang",
-        "libreadline-dev",
-        "zlib1g-dev",
-        "flex",
-        "bison",
-        "libxslt-dev",
-        "libssl-dev",
-        "libxml2-utils",
-        "xsltproc",
-        "ccache",
-        "pkg-config",
-    ])
-    envdlib.rust()
-    run(commands=[
-        "cargo install cargo-pgrx --version 0.10.0-beta.1",
-        "cargo pgrx init",
-    ])
+    config.repo(url="https://github.com/tensorchord/pgvecto.rs")
+    base(os="ubuntu20.04", language="python3")
+    shell("zsh")
+    io.copy("./envd.sh", "/tmp/build/envd.sh")
+    io.copy("./rust-toolchain.toml", "/tmp/build/rust-toolchain.toml")
+    io.copy("./Cargo.toml", "/tmp/build/Cargo.toml")
+    run(commands=["cd /tmp/build", "sudo -u envd ./envd.sh"])
diff --git a/envd.sh b/envd.sh
@@ -0,0 +1,34 @@
+#!/usr/bin/bash
+
+sudo apt-get update
+sudo apt-get install -y lsb-release
+sudo apt-get install -y gnupg
+echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" | sudo tee -a /etc/apt/sources.list.d/pgdg.list
+wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
+sudo apt-get update
+DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC sudo -E apt-get install tzdata
+sudo apt-get install -y build-essential
+sudo apt-get install -y libpq-dev
+sudo apt-get install -y libssl-dev
+sudo apt-get install -y pkg-config
+sudo apt-get install -y gcc
+sudo apt-get install -y libreadline-dev
+sudo apt-get install -y flex
+sudo apt-get install -y bison
+sudo apt-get install -y libxml2-dev
+sudo apt-get install -y libxslt-dev
+sudo apt-get install -y libxml2-utils
+sudo apt-get install -y xsltproc
+sudo apt-get install -y zlib1g-dev
+sudo apt-get install -y ccache
+sudo apt-get install -y clang
+sudo apt-get install -y git
+sudo apt-get install -y postgresql-15
+sudo apt-get install -y postgresql-server-dev-15
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
+source "$HOME/.cargo/env"
+rev=$(cat Cargo.toml | grep "pgrx =" | awk -F 'rev = "' '{print $2}' | cut -d'"' -f1)
+cargo install cargo-pgrx --git https://github.com/tensorchord/pgrx.git --rev $rev
+cargo pgrx init --pg15=/usr/lib/postgresql/15/bin/pg_config
+sudo chmod 777 /usr/share/postgresql/15/extension/
+sudo chmod 777 /usr/lib/postgresql/15/lib/