-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gguf-hash: model wide and per tensor hashing using xxhash and sha1 #8048
Conversation
There is also the option to use existing hash utilities to hash the GGUF data. For example, something like: # skip the GGUF header
dd bs=1 skip=$(gguf-dump --data-offset model.gguf) if=model.gguf | sha256sum Would that work? |
@ggerganov gave your approach a shot #8054 (PR to add --data-offset and --data-alignment) it does work, but your initial suggestion of setting bs=1 and using skip=X was very slow. Turns out you should set bs=X and skip=1. $:~/Documents/LLMmodel/gguf$ time dd bs=$(~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-offset phi-2.Q6_K.gguf) skip=1 if=phi-2.Q6_K.gguf | sha1sum
1264+1 records in
1264+1 records out
2283253760 bytes (2.3 GB, 2.1 GiB) copied, 4.32916 s, 527 MB/s
32ea6e22a0c63beef6ce2ba15471689b8144b39c -
real 0m7.200s
user 0m6.797s
sys 0m1.326s
$:~/Documents/LLMmodel/gguf$ time dd bs=$(~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-offset phi-2.Q6_K.gguf) skip=1 if=phi-2.Q6_K.gguf | sha256sum
1264+1 records in
1264+1 records out
2283253760 bytes (2.3 GB, 2.1 GiB) copied, 9.95004 s, 229 MB/s
8b5eea25e2946b05e345dc0e1dea191968bd2ebc6a15cb321085391dc89d9692 -
real 0m13.016s
user 0m12.744s
sys 0m1.509s Looks like I think GG's approach is valid as it will be faster as long as these assumption holds (so we could use that for internal CI tests as it be obvious if it breaks because of gguf file format evolution). However you may still want to keep this PR if you want to support per tensor hash checks. Also I would like to develop a consistent way to identify gguf models by model tensors (even if kv metadata changes) |
Attempted to add sha256 to the gguf-hash.c, but for some reason it just doesn't want to work, so abandoned that approach. Anyway, I've added UUIDv5 model ID generation to the C implementation (Using uuid.uuid5(uuid.NAMESPACE_URL, 'en.wikipedia.org/wiki/Llama.cpp') --> "ef001206-dadc-5f6d-a15f-3359e577d4e5" as the UUIDv5 namespace) and made sure it matches the python implementation. This was relatively easy as I've already got sha1 working in gguf-hash.c So now we got a consistent way of generating a UUIDv5 based on the GGUF tensor content if we choose to do so. Below is how I checked both generated the same UUIDv5
Anyway, this PR is now considered operational. |
0dbd834
to
029a963
Compare
Unsure what's the issue with makefile in the windows context... |
@mofosyne The problem is with https://github.com/ggerganov/llama.cpp/actions/runs/9632516256/job/26565799805?pr=8048#step:7:80
|
@compilade. That's pretty strange... so basically visual studio don't support all C11 features? This is the checks in xxhash.h #if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201112L) /* >= C11 */
# include <stdalign.h>
# define XXH_ALIGN(n) alignas(n)
#elif defined(__cplusplus) && (__cplusplus >= 201103L) /* >= C++11 */
/* In C++ alignas() is a keyword */
# define XXH_ALIGN(n) alignas(n)
#elif defined(__GNUC__)
# define XXH_ALIGN(n) __attribute__ ((aligned(n)))
#elif defined(_MSC_VER)
# define XXH_ALIGN(n) __declspec(align(n))
#else
# define XXH_ALIGN(n) /* disabled */
#endif edit: Turns out windows C11 at least for windows-2019 (unsure if fixed on windows-2020 github runner) is lying about it's support for C11 standard as explained in google-deepmind/mujoco#862 . They had to do a workaround in google-deepmind/mujoco@ac6663f . Be interesting to see if newer windows build works better... should we update the github runner to the latest windows-2020 version? (Pushing a commit to test the idea) |
c9ae59f
to
ddfe234
Compare
Regarding (Although |
@compilade finally! Thanks for your assistance here. Did some sanity checking against the other gguf dump --data-offset feature that was recently added approach. Overall the checksum for the whole tensor arrays in this test via ~/gitextern/llama.cpp/llama-gguf-hash is:
Which matches the gguf-dump approach : $ dd bs=$(~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-offset phi-2.Q6_K.gguf) skip=1 if=phi-2.Q6_K.gguf status=none | xxhsum
818489b2138f418f stdin
$ dd bs=$(~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-offset phi-2.Q6_K.gguf) skip=1 if=phi-2.Q6_K.gguf status=none | sha1sum
32ea6e22a0c63beef6ce2ba15471689b8144b39c -
$ dd bs=$(~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-offset phi-2.Q6_K.gguf) skip=1 if=phi-2.Q6_K.gguf status=none | sha256sum
8b5eea25e2946b05e345dc0e1dea191968bd2ebc6a15cb321085391dc89d9692 - Also on cross checking that the python matches the C approach for UUID generation (and also sha1): $~/gitextern/llama.cpp/llama-gguf-hash --sha1 phi-2.Q6_K.gguf
...
sha1 32ea6e22a0c63beef6ce2ba15471689b8144b39c phi-2.Q6_K.gguf
$~/gitextern/llama.cpp/llama-gguf-hash --uuid phi-2.Q6_K.gguf
UUIDv5 15608c46-42f1-50ae-b98f-04c394f6806f phi-2.Q6_K.gguf
$~/gitextern/llama.cpp/gguf-py/scripts/gguf-hash.py phi-2.Q6_K.gguf
...
sha1 32ea6e22a0c63beef6ce2ba15471689b8144b39c phi-2.Q6_K.gguf
UUIDv5 15608c46-42f1-50ae-b98f-04c394f6806f phi-2.Q6_K.gguf So this PR is now ready for review |
Validation added now, this is how you may use it. A manifest file can contain multiple different gguf files in it, which may be useful if you are creating a ci/cd validation file and don't want to pollute the repo with too many test files. If a gguf file you are validating doesn't match, then the program will return an error exit code. Also note that unlike sha256sum, this hash doesn't support checking for presence of other gguf files. In fact I think I want a better word than manifest, considering this file is more of a 'database of hashes' rather than a record of all files we are expecting in an archive. Generate manifestTo generate we may use this command ./llama-gguf-hash --all test.gguf > test.gguf.manifest Which would generate a manifest that looks like below, which contains multiple hash type and per tensor layer hashes as well xxh64 f66e9cd66a4396a0 test.gguf:tensor_0
sha1 59f79ecefd8125a996fdf419239051a7e99e5f20 test.gguf:tensor_0
sha256 c0510d38fa060c46265e0160a85c7243096b01dd31c2f355bdbb5516b20de1bd test.gguf:tensor_0
xxh64 7d3a1f9ac04d0537 test.gguf:tensor_1
sha1 4765f592eacf096df4628ba59476af94d767080a test.gguf:tensor_1
sha256 8514cbcc73692a2c56bd7a33a022edd5ff819614bd23b19915d7224387f397a7 test.gguf:tensor_1
xxh64 a0af5d700049693b test.gguf:tensor_2
sha1 25cbfbad4513cc348e2c95ebdee69d6ff2fd8753 test.gguf:tensor_2
sha256 947e6b36e20f2cc95e1d2ce1c1669d813d574657ac6b5ac5196158d454d35180 test.gguf:tensor_2
xxh64 e83fddf559d7b6a6 test.gguf:tensor_3
sha1 a9cba73e2d90f2ee3dae2548caa42bef3fe6a96c test.gguf:tensor_3
sha256 423b044e016d8ac73c39f23f60bf01bedef5ecb03c0230accd824c91fe86f1a1 test.gguf:tensor_3
xxh64 1257733306b7992d test.gguf:tensor_4
sha1 d7bc61db93bb685ce9d598da89717c66729b7543 test.gguf:tensor_4
sha256 79737cb3912d4201384cf7f16a1a37ff7823f23ea796cb205b6ca361ab9e3ebf test.gguf:tensor_4
xxh64 d238d16ba4711e58 test.gguf:tensor_5
sha1 0706566c198fe1072f37e0a5135b4b5f23654c52 test.gguf:tensor_5
sha256 60949be8298eced0ecdde64487643d018407bd261691e061d9e9c3dbc9fd358b test.gguf:tensor_5
xxh64 3fbc3b65ab8c7f39 test.gguf:tensor_6
sha1 73922a0727226a409049f6fc3172a52219ca6f00 test.gguf:tensor_6
sha256 574f4c46ff384a3b9a225eb955d2a871847a2e8b3fa59387a8252832e92ef7b0 test.gguf:tensor_6
xxh64 c22021c29854f093 test.gguf:tensor_7
sha1 efc39cece6a951188fc41e354c73bbfe6813d447 test.gguf:tensor_7
sha256 4c0410cd3c500f078ae5b21e8dc9eb79e29112713b2ab58a882f82a3868d4d75 test.gguf:tensor_7
xxh64 936df61f5d64261f test.gguf:tensor_8
sha1 c2490296d789a4f34398a337fed8377d943d9f06 test.gguf:tensor_8
sha256 c4401313feeba0261275c3b25bd2d8fe40ce04e0f440c2980ed0e9674c30ff01 test.gguf:tensor_8
xxh64 93fd20c64421c081 test.gguf:tensor_9
sha1 7047ce1e78437a6884337a3751c7ee0421918a65 test.gguf:tensor_9
sha256 23d57cf0d7a6e90b0b3616b41300e0cd354781e812add854a5f95aa55f2bc514 test.gguf:tensor_9
xxh64 5a54d3aad816f302 test.gguf
sha1 d15be52c4ff213e823cb6dd13af7ee2f978e7042 test.gguf
sha256 7dd641b32f59b60dbd4b5420c4b0f6321ccf48f58f6ae201a3dbc4a58a27c6e4 test.gguf ValidationBelow are some examples of different validations you can do with a manifest Check using strongest hashWe can then use the normal check command which will by default check for the highest security strength hash and verify against that: $ ./llama-gguf-hash --check test.gguf.manifest test.gguf
manifest test.gguf.manifest sha256 sha1 xxh64
sha256 c0510d38fa060c46265e0160a85c7243096b01dd31c2f355bdbb5516b20de1bd test.gguf:tensor_0 - Ok
sha256 8514cbcc73692a2c56bd7a33a022edd5ff819614bd23b19915d7224387f397a7 test.gguf:tensor_1 - Ok
sha256 947e6b36e20f2cc95e1d2ce1c1669d813d574657ac6b5ac5196158d454d35180 test.gguf:tensor_2 - Ok
sha256 423b044e016d8ac73c39f23f60bf01bedef5ecb03c0230accd824c91fe86f1a1 test.gguf:tensor_3 - Ok
sha256 79737cb3912d4201384cf7f16a1a37ff7823f23ea796cb205b6ca361ab9e3ebf test.gguf:tensor_4 - Ok
sha256 60949be8298eced0ecdde64487643d018407bd261691e061d9e9c3dbc9fd358b test.gguf:tensor_5 - Ok
sha256 574f4c46ff384a3b9a225eb955d2a871847a2e8b3fa59387a8252832e92ef7b0 test.gguf:tensor_6 - Ok
sha256 4c0410cd3c500f078ae5b21e8dc9eb79e29112713b2ab58a882f82a3868d4d75 test.gguf:tensor_7 - Ok
sha256 c4401313feeba0261275c3b25bd2d8fe40ce04e0f440c2980ed0e9674c30ff01 test.gguf:tensor_8 - Ok
sha256 23d57cf0d7a6e90b0b3616b41300e0cd354781e812add854a5f95aa55f2bc514 test.gguf:tensor_9 - Ok
sha256 7dd641b32f59b60dbd4b5420c4b0f6321ccf48f58f6ae201a3dbc4a58a27c6e4 test.gguf - Ok
Verification results for test.gguf.manifest - Success Check using fastest hashOr we may explicitly ask for a faster hash like: $ ./llama-gguf-hash --check test.gguf.manifest --xxh64 test.gguf
manifest test.gguf.manifest sha256 sha1 xxh64
xxh64 f66e9cd66a4396a0 test.gguf:tensor_0 - Ok
xxh64 7d3a1f9ac04d0537 test.gguf:tensor_1 - Ok
xxh64 a0af5d700049693b test.gguf:tensor_2 - Ok
xxh64 e83fddf559d7b6a6 test.gguf:tensor_3 - Ok
xxh64 1257733306b7992d test.gguf:tensor_4 - Ok
xxh64 d238d16ba4711e58 test.gguf:tensor_5 - Ok
xxh64 3fbc3b65ab8c7f39 test.gguf:tensor_6 - Ok
xxh64 c22021c29854f093 test.gguf:tensor_7 - Ok
xxh64 936df61f5d64261f test.gguf:tensor_8 - Ok
xxh64 93fd20c64421c081 test.gguf:tensor_9 - Ok
xxh64 5a54d3aad816f302 test.gguf - Ok
Verification results for test.gguf.manifest - Success Check using all hashesOr maybe we want to just check that all the hash is valid: $./llama-gguf-hash --check test.gguf.manifest --all test.gguf.manifest
manifest test.gguf.manifest sha256 sha1 xxh64
xxh64 f66e9cd66a4396a0 test.gguf:tensor_0 - Ok
sha1 59f79ecefd8125a996fdf419239051a7e99e5f20 test.gguf:tensor_0 - Ok
sha256 c0510d38fa060c46265e0160a85c7243096b01dd31c2f355bdbb5516b20de1bd test.gguf:tensor_0 - Ok
xxh64 7d3a1f9ac04d0537 test.gguf:tensor_1 - Ok
sha1 4765f592eacf096df4628ba59476af94d767080a test.gguf:tensor_1 - Ok
sha256 8514cbcc73692a2c56bd7a33a022edd5ff819614bd23b19915d7224387f397a7 test.gguf:tensor_1 - Ok
xxh64 a0af5d700049693b test.gguf:tensor_2 - Ok
sha1 25cbfbad4513cc348e2c95ebdee69d6ff2fd8753 test.gguf:tensor_2 - Ok
sha256 947e6b36e20f2cc95e1d2ce1c1669d813d574657ac6b5ac5196158d454d35180 test.gguf:tensor_2 - Ok
xxh64 e83fddf559d7b6a6 test.gguf:tensor_3 - Ok
sha1 a9cba73e2d90f2ee3dae2548caa42bef3fe6a96c test.gguf:tensor_3 - Ok
sha256 423b044e016d8ac73c39f23f60bf01bedef5ecb03c0230accd824c91fe86f1a1 test.gguf:tensor_3 - Ok
xxh64 1257733306b7992d test.gguf:tensor_4 - Ok
sha1 d7bc61db93bb685ce9d598da89717c66729b7543 test.gguf:tensor_4 - Ok
sha256 79737cb3912d4201384cf7f16a1a37ff7823f23ea796cb205b6ca361ab9e3ebf test.gguf:tensor_4 - Ok
xxh64 d238d16ba4711e58 test.gguf:tensor_5 - Ok
sha1 0706566c198fe1072f37e0a5135b4b5f23654c52 test.gguf:tensor_5 - Ok
sha256 60949be8298eced0ecdde64487643d018407bd261691e061d9e9c3dbc9fd358b test.gguf:tensor_5 - Ok
xxh64 3fbc3b65ab8c7f39 test.gguf:tensor_6 - Ok
sha1 73922a0727226a409049f6fc3172a52219ca6f00 test.gguf:tensor_6 - Ok
sha256 574f4c46ff384a3b9a225eb955d2a871847a2e8b3fa59387a8252832e92ef7b0 test.gguf:tensor_6 - Ok
xxh64 c22021c29854f093 test.gguf:tensor_7 - Ok
sha1 efc39cece6a951188fc41e354c73bbfe6813d447 test.gguf:tensor_7 - Ok
sha256 4c0410cd3c500f078ae5b21e8dc9eb79e29112713b2ab58a882f82a3868d4d75 test.gguf:tensor_7 - Ok
xxh64 936df61f5d64261f test.gguf:tensor_8 - Ok
sha1 c2490296d789a4f34398a337fed8377d943d9f06 test.gguf:tensor_8 - Ok
sha256 c4401313feeba0261275c3b25bd2d8fe40ce04e0f440c2980ed0e9674c30ff01 test.gguf:tensor_8 - Ok
xxh64 93fd20c64421c081 test.gguf:tensor_9 - Ok
sha1 7047ce1e78437a6884337a3751c7ee0421918a65 test.gguf:tensor_9 - Ok
sha256 23d57cf0d7a6e90b0b3616b41300e0cd354781e812add854a5f95aa55f2bc514 test.gguf:tensor_9 - Ok
xxh64 5a54d3aad816f302 test.gguf - Ok
sha1 d15be52c4ff213e823cb6dd13af7ee2f978e7042 test.gguf - Ok
sha256 7dd641b32f59b60dbd4b5420c4b0f6321ccf48f58f6ae201a3dbc4a58a27c6e4 test.gguf - Ok
Verification results for test.gguf.manifest - Success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels a bit of an overkill to add so many hashes. I would have probably used something like th64
instead.
As long as the hashing logic does not become a core component of the library (i.e. hashes should not be needed for normal operation), it's ok to live in the examples. At least until we see a clear benefit in the future. If build issues arise with this code, they will be treated with low-priority and potentially disabled until resolved
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Yeah it's a bit overkill. It's because I wanted to add sha1 so I can add in uuid (and have a way to cross check my logic)... then I realized it's also useful for CI/CD with the xxhash... and then might as well add in sha256. Basically scope creeping. Well at least it's now done... Thanks for the link https://github.com/tidwall/th64 looks interesting, but doesn't seem too much different to other hashes according to author of SMHasher3 in https://news.ycombinator.com/item?id=40877460.
At least xxhash seems to be much more well used and battle tested in multiple projects and thus can be better relied upon (plus there is the
That I agree. I wasn't too sure on how useful it would be so hence example. But if it ends up becoming core part of CI/CD then definitely consider relocating it. Yes just disable it if it causes problem. It's not a super complicated program, so I don't see it breaking too much. Anyway got your two suggestions in so will merge when all checks passed thanks |
…gerganov#8048) CLI to hash GGUF files to detect difference on a per model and per tensor level The hash type we support is: - `--xxh64`: use xhash 64bit hash mode (default) - `--sha1`: use sha1 - `--uuid`: use uuid - `--sha256`: use sha256 While most POSIX systems already have hash checking programs like sha256sum, it is designed to check entire files. This is not ideal for our purpose if we want to check for consistency of the tensor data even if the metadata content of the gguf KV store has been updated. This program is designed to hash a gguf tensor payload on a 'per tensor layer' in addition to a 'entire tensor model' hash. The intent is that the entire tensor layer can be checked first but if there is any detected inconsistencies, then the per tensor hash can be used to narrow down the specific tensor layer that has inconsistencies. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…gerganov#8048) CLI to hash GGUF files to detect difference on a per model and per tensor level The hash type we support is: - `--xxh64`: use xhash 64bit hash mode (default) - `--sha1`: use sha1 - `--uuid`: use uuid - `--sha256`: use sha256 While most POSIX systems already have hash checking programs like sha256sum, it is designed to check entire files. This is not ideal for our purpose if we want to check for consistency of the tensor data even if the metadata content of the gguf KV store has been updated. This program is designed to hash a gguf tensor payload on a 'per tensor layer' in addition to a 'entire tensor model' hash. The intent is that the entire tensor layer can be checked first but if there is any detected inconsistencies, then the per tensor hash can be used to narrow down the specific tensor layer that has inconsistencies. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This is a WIP PR proposal for layer hashing and model hashing of each layer of a gguf model
I previously did an experiment attempting to make the hashing process independent of quantisation, but that turns out to have too many technical issue and plus the use case for such feature is uncertain.
This PR on the other hands focus only on just hashing each tensors as an opaque data area without caring to decode the content.
The application for this feature is as part of some ci flow so instead of storing test files outputs you can just store the expected hash output. You can use this then to check for regression. For this reason i added xxhash as it is a much faster than sha1 in hashing, but left sha1 in because it is more widely supported (e.g. built into python)
I also for the python hash implementation added a UUIDv5 generator, I plan to add that to the c side if it makes sense.
As one of my idea is that every model would have a unique UUID based on the model content. Would be happy to hear feedback about this as I plan to include it during model conversion processes. e.g.
Note that for the global model wide hashing, I just hash every tensor in the order that was dumped from the gguf file... so if the tensor order is swapped in the file then the hash will likely change.
(For this PR, I decided that KV store hash is outside of scope)
example of sha1 output of phi-2.Q6_K.gguf
example of xxhash output of phi-2.Q6_K.gguf
example of sha256 output of phi-2.Q6_K.gguf
Overall the checksum for the whole tensor arrays in this test is:
Which matches the cross Checking with gguf dump --data-offset feature that was recently added:
Note that this cross check method will hold true until we start appending any non tensor data in future gguf format.
As for the UUID via tensor generation this is what I got: