Skip to content

Commit

Permalink
Merge pull request #61 from dynatrace-oss/release-0.7.1
Browse files Browse the repository at this point in the history
prepare 0.7.1 release
  • Loading branch information
oertl authored Dec 13, 2022
2 parents 97995ee + dde4229 commit 55a137c
Show file tree
Hide file tree
Showing 72 changed files with 9,035 additions and 9,001 deletions.
34 changes: 21 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ To add a dependency on hash4j using Maven, use the following:
<dependency>
<groupId>com.dynatrace.hash4j</groupId>
<artifactId>hash4j</artifactId>
<version>0.7.0</version>
<version>0.7.1</version>
</dependency>
```
To add a dependency using Gradle:
```gradle
implementation 'com.dynatrace.hash4j:hash4j:0.7.0'
implementation 'com.dynatrace.hash4j:hash4j:0.7.1'
```

## Hash algorithms
Expand Down Expand Up @@ -100,21 +100,24 @@ the state size in bits multiplied by the squared relative standard error of the

$\text{storage factor} := (\text{relative standard error})^2 \times (\text{state size})$.

This library implements two algorithms:
This library implements two algorithms for approximate distinct counting:
* [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog): This implementation uses [6-bit registers](https://doi.org/10.1145/2452376.2452456) and
an [improved distinct count estimator](https://arxiv.org/abs/1702.01284). Its asymptotic storage factor
is 6.477. The state size is a function of the precision parameter $p$, which defines the number of
registers as $m = 2^p$ and results in a state size of $6m = 6\cdot 2^p$ bits. Using the definition of the storage factor, the
relative standard error is roughly $\sqrt{\frac{6.477}{6 m}} = \frac{1.039}{\sqrt{m}}$ as empirically confirmed by [simulation results](doc/hyperloglog-estimation-error.md).
an [improved distinct count estimator](https://arxiv.org/abs/1702.01284).
Its asymptotic storage factor is $18 \ln 2 - 6 = 6.477$. The state size is $6m = 6\cdot 2^p$ bits, where the precision parameter $p$ also defines the number of registers as $m = 2^p$. Using the definition of the storage factor, the relative standard error is roughly $\sqrt{\frac{6.477}{6 m}} = \frac{1.039}{\sqrt{m}}$.
In case of non-distributed data streams, the [martingale estimator](src/main/java/com/dynatrace/hash4j/distinctcount/MartingaleEstimator.java) can be used,
which gives slightly better estimation results as the asymptotic storage factor is $6\ln 2 = 4.159$.
This gives a relative standard error of $\sqrt{\frac{6\ln 2}{6m}} = \frac{0.833}{\sqrt{m}}$.
The theoretically predicted estimation errors have been empirically confirmed by [simulation results](doc/hyperloglog-estimation-error.md).
* UltraLogLog: This is a new algorithm that will be described in detail in an upcoming paper. It has an
asymptotic storage factor of 4.936, which corresponds to a 24% reduction compared to HyperLogLog.
UltraLogLog uses 8-bit registers to enable fast random accesses and updates of the registers. Like for HyperLogLog,
the number of registers $m = 2^p$ depends on the chosen precision parameter $p$ and corresponds to the state size in bytes. The relative standard error
is approximately $\sqrt{\frac{4.936}{8 m}} = \frac{0.785}{\sqrt{m}}$ as confirmed by
[simulation results](doc/ultraloglog-estimation-error.md).

the number of registers $m = 2^p$ depends on the chosen precision parameter $p$ and corresponds to the state size in bytes.
The relative standard error is approximately $\sqrt{\frac{4.936}{8 m}} = \frac{0.785}{\sqrt{m}}$. If the martingale estimator can
be used, the storage factor will be just $5 \ln 2 = 3.466$ yielding an asymptotic relative standard error of $\sqrt{\frac{5 \ln 2}{8 m}} = \frac{0.658}{\sqrt{m}}$.
These theoretical formulas again agree well with the [simulation results](doc/ultraloglog-estimation-error.md).

Both algorithms share the following properties:
* Constant-time (HyperLogLog) & branch-free (UltraLogLog) add-operations
* Constant-time add-operations
* Allocation-free updates
* Idempotency, adding items already inserted before will never change the internal state
* Mergeability, even for data structures initialized with different precision parameters
Expand All @@ -133,7 +136,12 @@ sketch.add(hasher.hashCharsToLong("foo"));

double distinctCountEstimate = sketch.getDistinctCountEstimate(); // gives a value close to 2
```
See also [UltraLogLogDemo.java](src/test/java/com/dynatrace/hash4j/distinctcount/UltraLogLogDemo.java).
See also [UltraLogLogDemo.java](src/test/java/com/dynatrace/hash4j/distinctcount/UltraLogLogDemo.java) and [HyperLogLogDemo.java](src/test/java/com/dynatrace/hash4j/distinctcount/HyperLogLogDemo.java).

### Compatibility
HyperLogLog and UltraLogLog sketches can be reduced to corresponding sketches with smaller precision parameter `p` using `sketch.downsize(p)`. UltraLogLog sketches can be also transformed into HyperLogLog sketches with same precision parameter using `HyperLogLog hyperLogLog = HyperLogLog.create(ultraLogLog);` as demonstrated in [ConversionDemo.java](src/test/java/com/dynatrace/hash4j/distinctcount/ConversionDemo.java).
HyperLogLog can be made compatible with implementations of other libraries which also use a single 64-bit hash value as input. The implementations usually differ only in which bits of the hash value are used for the register index and which bits are used to determine the number of leading (or trailing) zeros.
Therefore, if the bits of the hash value are permuted accordingly, compatibility can be achieved.

## Contribution FAQ

Expand Down
2 changes: 1 addition & 1 deletion build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ java {
}

group = 'com.dynatrace.hash4j'
version = '0.7.0'
version = '0.7.1'

spotless {
ratchetFrom 'origin/main'
Expand Down
8 changes: 6 additions & 2 deletions doc/hyperloglog-estimation-error.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
### HyperLogLog estimation error

The state of an HyperLogLog sketch with precision parameter $p$ requires $m = 0.75 \cdot 2^p$ bytes. The expected relative standard error is approximately given by
$\frac{1.039}{\sqrt{m}}$. This is a good approximation for all $p\geq 6$ and large distinct counts. However, the error is significantly smaller for distinct counts that are in the order of $m$ or smaller. The bias is always much smaller than the root-mean-square error (rmse) and can therefore be neglected. The following charts show the empirically evaluated relative error as a function of the true distinct count for various precision parameters $p$ based on 100k simulation runs:
The state of an HyperLogLog sketch with precision parameter $p$ requires $m = 0.75 \cdot 2^p$ bytes.
The expected relative standard error is approximately given by $\frac{1.039}{\sqrt{m}}$ and $\frac{0.833}{\sqrt{m}}$ for the default and the martingale estimator, respectively.
This is a good approximation for all $p\geq 6$ and large distinct counts.
However, the error is significantly smaller for distinct counts that are in the order of $m$ or smaller.
The bias is always much smaller than the root-mean-square error (rmse) and can therefore be neglected.
The following charts show the empirically evaluated relative error as a function of the true distinct count for various precision parameters $p$ based on 100k simulation runs:

<img src="../test-results/hyperloglog-estimation-error-p3.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p4.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p5.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p6.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p7.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p8.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p9.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p10.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p11.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p12.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p13.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p14.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p15.png" width="400"><img src="../test-results/hyperloglog-estimation-error-p16.png" width="400">
8 changes: 6 additions & 2 deletions doc/ultraloglog-estimation-error.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
### UltraLogLog estimation error

The state of an UltraLogLog sketch with precision parameter $p$ requires $m = 2^p$ bytes. The expected relative standard error is approximately given by
$\frac{0.785}{\sqrt{m}}$. This is a good approximation for all $p\geq 6$ and large distinct counts. However, the error is significantly smaller for distinct counts that are in the order of $m$ or smaller. The bias is always much smaller than the root-mean-square error (rmse) and can therefore be neglected. The following charts show the empirically evaluated relative error as a function of the true distinct count for various precision parameters $p$ based on 100k simulation runs:
The state of an UltraLogLog sketch with precision parameter $p$ requires $m = 2^p$ bytes.
The expected relative standard error is approximately given by $\frac{0.785}{\sqrt{m}}$ and $\frac{0.658}{\sqrt{m}}$ for the default and the martingale estimator, respectively.
This is a good approximation for all $p\geq 6$ and large distinct counts.
However, the error is significantly smaller for distinct counts that are in the order of $m$ or smaller.
The bias is always much smaller than the root-mean-square error (rmse) and can therefore be neglected.
The following charts show the empirically evaluated relative error as a function of the true distinct count for various precision parameters $p$ based on 100k simulation runs:

<img src="../test-results/ultraloglog-estimation-error-p3.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p4.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p5.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p6.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p7.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p8.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p9.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p10.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p11.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p12.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p13.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p14.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p15.png" width="400"><img src="../test-results/ultraloglog-estimation-error-p16.png" width="400">
Binary file modified gradle/wrapper/gradle-wrapper.jar
Binary file not shown.
3 changes: 2 additions & 1 deletion gradle/wrapper/gradle-wrapper.properties
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-7.5.1-bin.zip
distributionUrl=https\://services.gradle.org/distributions/gradle-7.6-bin.zip
networkTimeout=10000
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists
12 changes: 8 additions & 4 deletions gradlew
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@
# Darwin, MinGW, and NonStop.
#
# (3) This script is generated from the Groovy template
# https://github.com/gradle/gradle/blob/master/subprojects/plugins/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt
# https://github.com/gradle/gradle/blob/HEAD/subprojects/plugins/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt
# within the Gradle project.
#
# You can find Gradle at https://github.com/gradle/gradle/.
Expand All @@ -80,10 +80,10 @@ do
esac
done

APP_HOME=$( cd "${APP_HOME:-./}" && pwd -P ) || exit

APP_NAME="Gradle"
# This is normally unused
# shellcheck disable=SC2034
APP_BASE_NAME=${0##*/}
APP_HOME=$( cd "${APP_HOME:-./}" && pwd -P ) || exit

# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
DEFAULT_JVM_OPTS='"-Xmx64m" "-Xms64m"'
Expand Down Expand Up @@ -143,12 +143,16 @@ fi
if ! "$cygwin" && ! "$darwin" && ! "$nonstop" ; then
case $MAX_FD in #(
max*)
# In POSIX sh, ulimit -H is undefined. That's why the result is checked to see if it worked.
# shellcheck disable=SC3045
MAX_FD=$( ulimit -H -n ) ||
warn "Could not query maximum file descriptor limit"
esac
case $MAX_FD in #(
'' | soft) :;; #(
*)
# In POSIX sh, ulimit -n is undefined. That's why the result is checked to see if it worked.
# shellcheck disable=SC3045
ulimit -n "$MAX_FD" ||
warn "Could not set maximum file descriptor limit to $MAX_FD"
esac
Expand Down
1 change: 1 addition & 0 deletions gradlew.bat
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ if "%OS%"=="Windows_NT" setlocal

set DIRNAME=%~dp0
if "%DIRNAME%"=="" set DIRNAME=.
@rem This is normally unused
set APP_BASE_NAME=%~n0
set APP_HOME=%DIRNAME%

Expand Down
65 changes: 56 additions & 9 deletions python/estimation_error_evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
import csv
import matplotlib.pyplot as plt
import glob
from matplotlib.lines import Line2D


def read_data(data_file):
Expand Down Expand Up @@ -58,11 +59,13 @@ def to_percent(values):
def plot_charts(filename):
d = read_data(filename)

colors = ["C0", "C1", "C2"]

values = d[1]
headers = d[0]

fig, ax = plt.subplots(1, 1, sharey="row", sharex=True)
fig.set_size_inches(6, 3)
fig.set_size_inches(6, 4)

p = int(headers["p"])

Expand Down Expand Up @@ -101,7 +104,7 @@ def plot_charts(filename):
+ num_simulation_runs_unit
)
ax.set_xscale("log", base=10)
theory = to_percent(values["theoretical relative standard error"])[0]
theory = to_percent(values["theoretical relative standard error default"])[0]

if headers["sketch_name"] == "ultraloglog":
ax.set_ylim([-theory * 0.1, theory * 1.15])
Expand All @@ -114,15 +117,60 @@ def plot_charts(filename):
ax.set_xlabel("distinct count")
ax.yaxis.grid(True)
ax.set_ylabel("relative error (%)")
ax.plot(values["distinct count"], to_percent(values["relative bias"]), label="bias")
ax.plot(values["distinct count"], to_percent(values["relative rmse"]), label="rmse")
ax.plot(
values["distinct count"],
to_percent(values["theoretical relative standard error"]),
label="theory",
to_percent(values["theoretical relative standard error martingale"]),
label="theory (martingale)",
color=colors[2],
linestyle="dotted",
)
ax.plot(
values["distinct count"],
to_percent(values["theoretical relative standard error default"]),
label="theory (default)",
color=colors[2],
)
ax.plot(
values["distinct count"],
to_percent(values["relative rmse martingale"]),
label="rmse (martingale)",
color=colors[1],
linestyle="dotted",
)
ax.plot(
values["distinct count"],
to_percent(values["relative rmse default"]),
label="rmse (default)",
color=colors[1],
)
ax.plot(
values["distinct count"],
to_percent(values["relative bias martingale"]),
label="bias (martingale)",
color=colors[0],
linestyle="dotted",
)
ax.plot(
values["distinct count"],
to_percent(values["relative bias default"]),
label="bias (default)",
color=colors[0],
)

legend_elements = [
Line2D([0], [0], color=colors[0]),
Line2D([0], [0], color=colors[1]),
Line2D([0], [0], color=colors[2]),
Line2D([0], [0], color="gray"),
Line2D([0], [0], color="gray", linestyle="dotted"),
]
fig.legend(
legend_elements,
["bias", "rmse", "theory", "default", "martingale"],
loc="lower center",
ncol=5,
)
# fig.legend(loc="center right")
ax.legend(loc="center right")
fig.subplots_adjust(top=0.93, bottom=0.21, left=0.11, right=0.99)
fig.savefig(
"test-results/"
+ headers["sketch_name"]
Expand All @@ -132,7 +180,6 @@ def plot_charts(filename):
format="png",
dpi=300,
metadata={"creationDate": None},
bbox_inches="tight",
)
plt.close(fig)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,9 @@
* estimate as the standard estimator. However, if many further elements are added, the martingale
* estimator may again produce better estimates.
*
* <p>The estimator remains valid if the associated data structure is downsized with {@link
* HyperLogLog#downsize(int)} or {@link UltraLogLog#downsize(int)}.
*
* <p>References:
*
* <ul>
Expand Down
4 changes: 2 additions & 2 deletions src/main/java/com/dynatrace/hash4j/hashing/HashValue128.java
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ public HashValue128(long mostSignificantBits, long leastSignificantBits) {
}

public int getAsInt() {
return (int) leastSignificantBits;
return (int) getLeastSignificantBits();
}

public long getMostSignificantBits() {
Expand All @@ -39,7 +39,7 @@ public long getLeastSignificantBits() {
}

public long getAsLong() {
return leastSignificantBits;
return getLeastSignificantBits();
}

@Override
Expand Down
Loading

0 comments on commit 55a137c

Please sign in to comment.