Merge pull request #61 from dynatrace-oss/release-0.7.1

prepare 0.7.1 release
dynatrace-oss · Dec 13, 2022 · 55a137c · 55a137c
2 parents 97995ee + dde4229
commit 55a137c
Show file tree

Hide file tree

Showing 72 changed files with 9,035 additions and 9,001 deletions.
diff --git a/README.md b/README.md
@@ -16,12 +16,12 @@ To add a dependency on hash4j using Maven, use the following:
 <dependency>
   <groupId>com.dynatrace.hash4j</groupId>
   <artifactId>hash4j</artifactId>
-  <version>0.7.0</version>
+  <version>0.7.1</version>
 </dependency>
 ```
 To add a dependency using Gradle:
 ```gradle
-implementation 'com.dynatrace.hash4j:hash4j:0.7.0'
+implementation 'com.dynatrace.hash4j:hash4j:0.7.1'
 ```
 
 ## Hash algorithms
@@ -100,21 +100,24 @@ the state size in bits multiplied by the squared relative standard error of the
 
 $\text{storage factor} := (\text{relative standard error})^2 \times (\text{state size})$.
 
-This library implements two algorithms:
+This library implements two algorithms for approximate distinct counting:
 * [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog): This implementation uses [6-bit registers](https://doi.org/10.1145/2452376.2452456) and
-an [improved distinct count estimator](https://arxiv.org/abs/1702.01284). Its asymptotic storage factor 
-is 6.477. The state size is a function of the precision parameter $p$, which defines the number of 
-registers as $m = 2^p$ and results in a state size of $6m = 6\cdot 2^p$ bits. Using the definition of the storage factor, the
-relative standard error is roughly $\sqrt{\frac{6.477}{6 m}} = \frac{1.039}{\sqrt{m}}$ as empirically confirmed by [simulation results](doc/hyperloglog-estimation-error.md).
+an [improved distinct count estimator](https://arxiv.org/abs/1702.01284). 
+Its asymptotic storage factor is $18 \ln 2 - 6 = 6.477$. The state size is $6m = 6\cdot 2^p$ bits, where the precision parameter $p$ also defines the number of registers as $m = 2^p$. Using the definition of the storage factor, the relative standard error is roughly $\sqrt{\frac{6.477}{6 m}} = \frac{1.039}{\sqrt{m}}$.
+In case of non-distributed data streams, the [martingale estimator](src/main/java/com/dynatrace/hash4j/distinctcount/MartingaleEstimator.java) can be used, 
+ which gives slightly better estimation results as the asymptotic storage factor is $6\ln 2 = 4.159$.
+This gives a relative standard error of $\sqrt{\frac{6\ln 2}{6m}} = \frac{0.833}{\sqrt{m}}$.
+The theoretically predicted estimation errors  have been empirically confirmed by [simulation results](doc/hyperloglog-estimation-error.md).
 * UltraLogLog: This is a new algorithm that will be described in detail in an upcoming paper. It has an
 asymptotic storage factor of 4.936, which corresponds to a 24% reduction compared to HyperLogLog.
 UltraLogLog uses 8-bit registers to enable fast random accesses and updates of the registers. Like for HyperLogLog,
-  the number of registers $m = 2^p$ depends on the chosen precision parameter $p$ and corresponds to the state size in bytes. The relative standard error 
-is approximately $\sqrt{\frac{4.936}{8 m}} = \frac{0.785}{\sqrt{m}}$ as confirmed by
-[simulation results](doc/ultraloglog-estimation-error.md).
-
+the number of registers $m = 2^p$ depends on the chosen precision parameter $p$ and corresponds to the state size in bytes.
+The relative standard error is approximately $\sqrt{\frac{4.936}{8 m}} = \frac{0.785}{\sqrt{m}}$. If the martingale estimator can 
+be used, the storage factor will be just $5 \ln 2 = 3.466$ yielding an asymptotic relative standard error of $\sqrt{\frac{5 \ln 2}{8 m}} = \frac{0.658}{\sqrt{m}}$.
+These theoretical formulas again agree well with the [simulation results](doc/ultraloglog-estimation-error.md).
+
 Both algorithms share the following properties:
-* Constant-time (HyperLogLog) & branch-free (UltraLogLog) add-operations
+* Constant-time add-operations
 * Allocation-free updates
 * Idempotency, adding items already inserted before will never change the internal state
 * Mergeability, even for data structures initialized with different precision parameters   
@@ -133,7 +136,12 @@ sketch.add(hasher.hashCharsToLong("foo"));
 
 double distinctCountEstimate = sketch.getDistinctCountEstimate(); // gives a value close to 2
 ```
-See also [UltraLogLogDemo.java](src/test/java/com/dynatrace/hash4j/distinctcount/UltraLogLogDemo.java).
+See also [UltraLogLogDemo.java](src/test/java/com/dynatrace/hash4j/distinctcount/UltraLogLogDemo.java) and [HyperLogLogDemo.java](src/test/java/com/dynatrace/hash4j/distinctcount/HyperLogLogDemo.java).
+
+### Compatibility
+HyperLogLog and UltraLogLog sketches can be reduced to corresponding sketches with smaller precision parameter `p` using `sketch.downsize(p)`. UltraLogLog sketches can be also transformed into HyperLogLog sketches with same precision parameter using `HyperLogLog hyperLogLog = HyperLogLog.create(ultraLogLog);` as demonstrated in [ConversionDemo.java](src/test/java/com/dynatrace/hash4j/distinctcount/ConversionDemo.java).
+HyperLogLog can be made compatible with implementations of other libraries which also use a single 64-bit hash value as input. The implementations usually differ only in which bits of the hash value are used for the register index and which bits are used to determine the number of leading (or trailing) zeros.
+Therefore, if the bits of the hash value are permuted accordingly, compatibility can be achieved.
 
 ## Contribution FAQ
 

diff --git a/build.gradle b/build.gradle
@@ -66,7 +66,7 @@ java {
 }
 
 group = 'com.dynatrace.hash4j'
-version = '0.7.0'
+version = '0.7.1'
 
 spotless {
     ratchetFrom 'origin/main'

diff --git a/doc/hyperloglog-estimation-error.md b/doc/hyperloglog-estimation-error.md
@@ -1,6 +1,10 @@
 ### HyperLogLog estimation error
 
-The state of an HyperLogLog sketch with precision parameter $p$ requires $m = 0.75 \cdot 2^p$ bytes. The expected relative standard error is approximately given by
-$\frac{1.039}{\sqrt{m}}$. This is a good approximation for all $p\geq 6$ and large distinct counts. However, the error is significantly smaller for distinct counts that are in the order of $m$ or smaller. The bias is always much smaller than the root-mean-square error (rmse) and can therefore be neglected. The following charts show the empirically evaluated relative error as a function of the true distinct count for various precision parameters $p$ based on 100k simulation runs:
+The state of an HyperLogLog sketch with precision parameter $p$ requires $m = 0.75 \cdot 2^p$ bytes.
+The expected relative standard error is approximately given by $\frac{1.039}{\sqrt{m}}$ and $\frac{0.833}{\sqrt{m}}$ for the default and the martingale estimator, respectively.
+This is a good approximation for all $p\geq 6$ and large distinct counts.
+However, the error is significantly smaller for distinct counts that are in the order of $m$ or smaller.
+The bias is always much smaller than the root-mean-square error (rmse) and can therefore be neglected.
+The following charts show the empirically evaluated relative error as a function of the true distinct count for various precision parameters $p$ based on 100k simulation runs:
 
 <img src="../test-results/hyperloglog-estimation-error-p3.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p4.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p5.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p6.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p7.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p8.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p9.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p10.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p11.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p12.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p13.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p14.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p15.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p16.png" width="400">
diff --git a/doc/ultraloglog-estimation-error.md b/doc/ultraloglog-estimation-error.md
@@ -1,6 +1,10 @@
 ### UltraLogLog estimation error
 
-The state of an UltraLogLog sketch with precision parameter $p$ requires $m = 2^p$ bytes. The expected relative standard error is approximately given by
-$\frac{0.785}{\sqrt{m}}$. This is a good approximation for all $p\geq 6$ and large distinct counts. However, the error is significantly smaller for distinct counts that are in the order of $m$ or smaller. The bias is always much smaller than the root-mean-square error (rmse) and can therefore be neglected. The following charts show the empirically evaluated relative error as a function of the true distinct count for various precision parameters $p$ based on 100k simulation runs:
+The state of an UltraLogLog sketch with precision parameter $p$ requires $m = 2^p$ bytes.
+The expected relative standard error is approximately given by $\frac{0.785}{\sqrt{m}}$ and $\frac{0.658}{\sqrt{m}}$ for the default and the martingale estimator, respectively.
+This is a good approximation for all $p\geq 6$ and large distinct counts.
+However, the error is significantly smaller for distinct counts that are in the order of $m$ or smaller.
+The bias is always much smaller than the root-mean-square error (rmse) and can therefore be neglected.
+The following charts show the empirically evaluated relative error as a function of the true distinct count for various precision parameters $p$ based on 100k simulation runs:
 
 <img src="../test-results/ultraloglog-estimation-error-p3.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p4.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p5.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p6.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p7.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p8.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p9.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p10.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p11.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p12.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p13.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p14.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p15.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p16.png" width="400">
diff --git a/gradle/wrapper/gradle-wrapper.jar b/gradle/wrapper/gradle-wrapper.jar
diff --git a/gradle/wrapper/gradle-wrapper.properties b/gradle/wrapper/gradle-wrapper.properties
@@ -1,5 +1,6 @@
 distributionBase=GRADLE_USER_HOME
 distributionPath=wrapper/dists
-distributionUrl=https\://services.gradle.org/distributions/gradle-7.5.1-bin.zip
+distributionUrl=https\://services.gradle.org/distributions/gradle-7.6-bin.zip
+networkTimeout=10000
 zipStoreBase=GRADLE_USER_HOME
 zipStorePath=wrapper/dists
diff --git a/gradlew b/gradlew
@@ -55,7 +55,7 @@
 #       Darwin, MinGW, and NonStop.
 #
 #   (3) This script is generated from the Groovy template
-#       https://github.com/gradle/gradle/blob/master/subprojects/plugins/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt
+#       https://github.com/gradle/gradle/blob/HEAD/subprojects/plugins/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt
 #       within the Gradle project.
 #
 #       You can find Gradle at https://github.com/gradle/gradle/.
@@ -80,10 +80,10 @@ do
     esac
 done
 
-APP_HOME=$( cd "${APP_HOME:-./}" && pwd -P ) || exit
-
-APP_NAME="Gradle"
+# This is normally unused
+# shellcheck disable=SC2034
 APP_BASE_NAME=${0##*/}
+APP_HOME=$( cd "${APP_HOME:-./}" && pwd -P ) || exit
 
 # Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
 DEFAULT_JVM_OPTS='"-Xmx64m" "-Xms64m"'
@@ -143,12 +143,16 @@ fi
 if ! "$cygwin" && ! "$darwin" && ! "$nonstop" ; then
     case $MAX_FD in #(
       max*)
+        # In POSIX sh, ulimit -H is undefined. That's why the result is checked to see if it worked.
+        # shellcheck disable=SC3045 
         MAX_FD=$( ulimit -H -n ) ||
             warn "Could not query maximum file descriptor limit"
     esac
     case $MAX_FD in  #(
       '' | soft) :;; #(
       *)
+        # In POSIX sh, ulimit -n is undefined. That's why the result is checked to see if it worked.
+        # shellcheck disable=SC3045 
         ulimit -n "$MAX_FD" ||
             warn "Could not set maximum file descriptor limit to $MAX_FD"
     esac

diff --git a/gradlew.bat b/gradlew.bat
@@ -26,6 +26,7 @@ if "%OS%"=="Windows_NT" setlocal
 
 set DIRNAME=%~dp0
 if "%DIRNAME%"=="" set DIRNAME=.
+@rem This is normally unused
 set APP_BASE_NAME=%~n0
 set APP_HOME=%DIRNAME%
 

diff --git a/python/estimation_error_evaluation.py b/python/estimation_error_evaluation.py
@@ -16,6 +16,7 @@
 import csv
 import matplotlib.pyplot as plt
 import glob
+from matplotlib.lines import Line2D
 
 
 def read_data(data_file):
@@ -58,11 +59,13 @@ def to_percent(values):
 def plot_charts(filename):
     d = read_data(filename)
 
+    colors = ["C0", "C1", "C2"]
+
     values = d[1]
     headers = d[0]
 
     fig, ax = plt.subplots(1, 1, sharey="row", sharex=True)
-    fig.set_size_inches(6, 3)
+    fig.set_size_inches(6, 4)
 
     p = int(headers["p"])
 
@@ -101,7 +104,7 @@ def plot_charts(filename):
         + num_simulation_runs_unit
     )
     ax.set_xscale("log", base=10)
-    theory = to_percent(values["theoretical relative standard error"])[0]
+    theory = to_percent(values["theoretical relative standard error default"])[0]
 
     if headers["sketch_name"] == "ultraloglog":
         ax.set_ylim([-theory * 0.1, theory * 1.15])
@@ -114,15 +117,60 @@ def plot_charts(filename):
     ax.set_xlabel("distinct count")
     ax.yaxis.grid(True)
     ax.set_ylabel("relative error (%)")
-    ax.plot(values["distinct count"], to_percent(values["relative bias"]), label="bias")
-    ax.plot(values["distinct count"], to_percent(values["relative rmse"]), label="rmse")
     ax.plot(
         values["distinct count"],
-        to_percent(values["theoretical relative standard error"]),
-        label="theory",
+        to_percent(values["theoretical relative standard error martingale"]),
+        label="theory (martingale)",
+        color=colors[2],
+        linestyle="dotted",
+    )
+    ax.plot(
+        values["distinct count"],
+        to_percent(values["theoretical relative standard error default"]),
+        label="theory (default)",
+        color=colors[2],
+    )
+    ax.plot(
+        values["distinct count"],
+        to_percent(values["relative rmse martingale"]),
+        label="rmse (martingale)",
+        color=colors[1],
+        linestyle="dotted",
+    )
+    ax.plot(
+        values["distinct count"],
+        to_percent(values["relative rmse default"]),
+        label="rmse (default)",
+        color=colors[1],
+    )
+    ax.plot(
+        values["distinct count"],
+        to_percent(values["relative bias martingale"]),
+        label="bias (martingale)",
+        color=colors[0],
+        linestyle="dotted",
+    )
+    ax.plot(
+        values["distinct count"],
+        to_percent(values["relative bias default"]),
+        label="bias (default)",
+        color=colors[0],
+    )
+
+    legend_elements = [
+        Line2D([0], [0], color=colors[0]),
+        Line2D([0], [0], color=colors[1]),
+        Line2D([0], [0], color=colors[2]),
+        Line2D([0], [0], color="gray"),
+        Line2D([0], [0], color="gray", linestyle="dotted"),
+    ]
+    fig.legend(
+        legend_elements,
+        ["bias", "rmse", "theory", "default", "martingale"],
+        loc="lower center",
+        ncol=5,
     )
-    # fig.legend(loc="center right")
-    ax.legend(loc="center right")
+    fig.subplots_adjust(top=0.93, bottom=0.21, left=0.11, right=0.99)
     fig.savefig(
         "test-results/"
         + headers["sketch_name"]
@@ -132,7 +180,6 @@ def plot_charts(filename):
         format="png",
         dpi=300,
         metadata={"creationDate": None},
-        bbox_inches="tight",
     )
     plt.close(fig)
 

diff --git a/src/main/java/com/dynatrace/hash4j/distinctcount/MartingaleEstimator.java b/src/main/java/com/dynatrace/hash4j/distinctcount/MartingaleEstimator.java
@@ -36,6 +36,9 @@
  * estimate as the standard estimator. However, if many further elements are added, the martingale
  * estimator may again produce better estimates.
  *
+ * <p>The estimator remains valid if the associated data structure is downsized with {@link
+ * HyperLogLog#downsize(int)} or {@link UltraLogLog#downsize(int)}.
+ *
  * <p>References:
  *
  * <ul>

diff --git a/src/main/java/com/dynatrace/hash4j/hashing/HashValue128.java b/src/main/java/com/dynatrace/hash4j/hashing/HashValue128.java
@@ -27,7 +27,7 @@ public HashValue128(long mostSignificantBits, long leastSignificantBits) {
   }
 
   public int getAsInt() {
-    return (int) leastSignificantBits;
+    return (int) getLeastSignificantBits();
   }
 
   public long getMostSignificantBits() {
@@ -39,7 +39,7 @@ public long getLeastSignificantBits() {
   }
 
   public long getAsLong() {
-    return leastSignificantBits;
+    return getLeastSignificantBits();
   }
 
   @Override