Skip to content

Galil-Seiferas algorithm: String search in constant space, linear time, for nonorderable alphabets

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

bluss/galil-seiferas

Repository files navigation

galil-seiferas

General string search in constant space, linear time, for nonorderable alphabets. Also known as exact string matching.

Please read the API documentation on docs.rs

build_status crates

Recent Changes

  • 0.1.5
    • Add gs_find_by with custom comparator support
    • Add struct Pattern for separate pattern preprocessing.
  • 0.1.4
    • Expression tweaks again, for the new nightly
  • 0.1.3
    • Update the algorithm to keep the position in the pattern during search passes. (Makes fewer comparisons with periodict patterns.)
    • Expression tweaks that improve benchmarks without affecting the algorithm
  • 0.1.2
    • Cleanup code, better code comments, refactor, more tests and no memcmp use
    • The library is now always no_std.
  • 0.1.1
    • Fix bug in decompose (#1)
  • 0.1.0
    • Initial release

Benchmarks

Here are some comparisons as a plain byte string searcher. This isn't the use case for this algorithm, its use case is for nonorderable alphabets. But we use this to have wide comparisons. Keep in mind that the byte string search is characterized by its very cheap comparison operation.

It's not fast — two way is a whole lot better, if you can use it. But we have one baseline, and that's the "naive" brute force search which is quadratic in the size of the input, in the worst case.

The repo https://github.com/bluss/scratchspace collects some string matching benchmarks, and a lot of them are “pathologies” or cases that were known to be bad for some algorithm. The result of one run are below. (Only subset available)

This run is just a quick one so it may be quite noisy. Unavoidably there are single tests whose results are off.

The G-S algorithm is gs_find; the KMP algorithm (in a simple implementation) is kmp_find; the substring searcher in Rust is just find:

test naive::bmh_find                     ... bench:         833 ns/iter (+/- 223) = 300 MB/s
test naive::brute_force                  ... bench:       1,907 ns/iter (+/- 472) = 131 MB/s
test naive::find                         ... bench:         530 ns/iter (+/- 15) = 471 MB/s
test naive::gs_find                      ... bench:         615 ns/iter (+/- 6) = 406 MB/s  // NOTE
test naive::kmp_find                     ... bench:         714 ns/iter (+/- 24)   350 MB/s
test naive::memmem                       ... bench:         185 ns/iter (+/- 4) = 1351 MB/s
test naive_longpat::bmh_find             ... bench:     325,798 ns/iter (+/- 5,490) = 306 MB/s
test naive_longpat::brute_force          ... bench:   2,161,608 ns/iter (+/- 120,669) = 46 MB/s
test naive_longpat::find                 ... bench:     191,133 ns/iter (+/- 7,939) = 523 MB/s
test naive_longpat::gs_find              ... bench:     260,667 ns/iter (+/- 9,659) = 383 MB/s // NOTE
test naive_longpat::memmem               ... bench:      55,119 ns/iter (+/- 3,846) = 1814 MB/s
test naive_rev::bmh_find                 ... bench:          36 ns/iter (+/- 1) = 6944 MB/s
test naive_rev::brute_force              ... bench:         281 ns/iter (+/- 204) = 889 MB/s
test naive_rev::find                     ... bench:         394 ns/iter (+/- 47) = 634 MB/s
test naive_rev::gs_find                  ... bench:         292 ns/iter (+/- 20) = 856 MB/s  // NOTE
test naive_rev::kmp_find                 ... bench:         514 ns/iter (+/- 17)   486 MB/s
test naive_rev::memmem                   ... bench:          11 ns/iter (+/- 0) = 22727 MB/s
test short_word1_long::bmh_find          ... bench:       1,486 ns/iter (+/- 53) = 1716 MB/s
test short_word1_long::brute_force       ... bench:       4,042 ns/iter (+/- 226) = 631 MB/s
test short_word1_long::find              ... bench:         729 ns/iter (+/- 28) = 3499 MB/s
test short_word1_long::gs_find           ... bench:       4,791 ns/iter (+/- 184) = 532 MB/s  // NOTE
test short_word1_long::kmp_find          ... bench:       6,515 ns/iter (+/- 272)  391 MB/s
test short_word1_long::memmem            ... bench:       1,670 ns/iter (+/- 374) = 1527 MB/s
test short_word2_long::bmh_find          ... bench:       2,623 ns/iter (+/- 131) = 972 MB/s
test short_word2_long::brute_force       ... bench:       3,865 ns/iter (+/- 212) = 660 MB/s
test short_word2_long::find              ... bench:       1,169 ns/iter (+/- 40) = 2182 MB/s
test short_word2_long::gs_find           ... bench:       4,245 ns/iter (+/- 196) = 600 MB/s  // NOTE
test short_word2_long::memmem            ... bench:       1,729 ns/iter (+/- 150) = 1475 MB/s

License

Dual-licensed to be compatible with the Rust project.

Licensed under the Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0 or the MIT license http://opensource.org/licenses/MIT, at your option. This file may not be copied, modified, or distributed except according to those terms.

About

Galil-Seiferas algorithm: String search in constant space, linear time, for nonorderable alphabets

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

No packages published