Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce more efficient implementations of str and join #545

Merged
merged 1 commit into from
Sep 26, 2024

Conversation

alexander-yakushev
Copy link
Contributor

Given that most of the purpose of HoneySQL is to mash strings together so that the user doesn't have to, it's natural that string mashing is responsible for most of its running time and generated allocations. Clojure's base library functions are not the most efficient for this kind of work:

  • clojure.core/str performs badly when provided more than 1 argument. It resorts to varargs and inefficient sequence iteration.
  • clojure.string/join also performs inefficient seq-based iteration of the provided collection.

This PR proposes adding a new namespace that contains optimized implementations of those two functions. As a bonus, the new join also has a transducer-accepting arity. Since it is a common pattern to string/join something after some action is performed on every element, the transducer arity becomes useful in many situations.

The new functions only have optimized bodies for :clj language selector and delegate to base functions everywhere else.

In the honey.sql namespace, clojure.core/str is replaced with honey.sql.util/str in the namespace form. join is referred directly, so everywhere in the code str/join is replaced with join and also modified to use transducer arity where necessary.

Obviously, I'm open to restructure the solution or dial back some changes if this is too much in one go.

@igrishaev
Copy link

Could you please share benchmarks (if you still have your repl open, by any chance)? Just wonder how much performance it gives.

@alexander-yakushev
Copy link
Contributor Author

I was mostly measuring the allocations improvements across multiple Metabase endpoints. This PR focuses mostly on allocations, the execution speed is not impacted as much. Here's one of the concrete queries:

(crit/quick-bench
   (format-dsl {:select [[[:min :recent_views.user_id] :user_id]
                         :model
                         :model_id
                         [[:max [:coalesce :d.view_count :t.view_count]] :cnt]
                         [:%max.timestamp :max_ts]]
                :group-by  [:model :model_id]
                :where     [:and
                            [:= :context "view"]
                            [:in :model #{"dashboard" "table"}]]
                :order-by  [[:max_ts :desc] [:model :desc]]
                :limit     10
                :left-join [[:report_dashboard :d]
                            [:and
                             [:= :model "dashboard"]
                             [:= :d.id :model_id]]
                            [:metabase_table :t]
                            [:and
                             [:= :model "table"]
                             [:= :t.id :model_id]]]}))

;; Before:

Evaluation count : 13122 in 6 samples of 2187 calls.
             Execution time mean : 50.880818 µs
    Execution time std-deviation : 11.514117 µs
   Execution time lower quantile : 44.925380 µs ( 2.5%)
   Execution time upper quantile : 70.466794 µs (97.5%)
                   Overhead used : 1.879403 ns

Allocations per call: 118,560b

---

;; After:

Evaluation count : 13188 in 6 samples of 2198 calls.
             Execution time mean : 45.462514 µs
    Execution time std-deviation : 53.006990 ns
   Execution time lower quantile : 45.418067 µs ( 2.5%)
   Execution time upper quantile : 45.542755 µs (97.5%)
                   Overhead used : 1.879403 ns

Allocations per call: 105,072b

I've made further improvements locally, but I don't want to dump everything into one PR to accommodate for easier review.

@p-himik
Copy link
Sponsor Contributor

p-himik commented Sep 26, 2024

A couple of counterpoints:

  • When the query is not dynamic or has few possible values, HoneySQL can be set to cache the formatter's result
  • Perhaps a better approach would be to use a string builder throughout

@alexander-yakushev
Copy link
Contributor Author

alexander-yakushev commented Sep 26, 2024

  1. I'm not sure whether such cache should be implemented on HoneySQL side. I think it's up to the consumer to know which queries are cacheable/AOT-computable and act accordingly.
  2. I agree that carrying around a stringbuilder would be even more efficient; however, I aimed at having minimal impact with this PR. Single StringBuilder approach will require much more rewriting. It can be done; perhaps, as another change in the future (if the improvement justifies the rewrite). A StringBuilder approach would also be harder to make CLJS-compatible in a clean way.

@seancorfield
Copy link
Owner

I'll note that the caching is built into HoneySQL -- but it is opt-in and you have to add the clojure.core.cache dependency and choose, on a per-call basis, whether to use caching. You need to use named parameters to really benefit from that (and there are a couple of SQL constructs you currently have to avoid, e.g., :in due to how they are currently expanded).

I like the idea of speeding up str and join but have two comments:

  1. It almost seems like this should be a separate library (that HoneySQL could then depend on)
  2. I feel strongly that the transducer arity of join should be (join sep xform coll) -- the parallel is (into to xform from) IMO and it makes the transformation from (join sep (map f coll)) to (join sep (map f) coll) more natural

@alexander-yakushev
Copy link
Contributor Author

I understand what you mean, but I really really dislike "util" deps in other libraries, and I don't think it's a good idea to multiplicate those. Maybe, at some point, when the list of "improved core functions" is big enough, it would make sense to move those into a separate library and depend upon. Until then, there is a a big value in HoneySQL having no dependencies, and keeping it this way is valuable. Adding two utility functions is a small price to pay for that.

I agree. I didn't know which to pick between the two; glad you have a strong stance on it.

@alexander-yakushev
Copy link
Contributor Author

Updated WRT to comment 2. by @seancorfield. Also added more str arities.

@seancorfield
Copy link
Owner

Thanks. That makes the diff a lot more readable as well as having a more intuitive arg order.

@seancorfield seancorfield merged commit 8c93e28 into seancorfield:develop Sep 26, 2024
4 checks passed
@seancorfield
Copy link
Owner

Can you explain why this line is present? clj-kondo flags it as an error (unresolved symbol):

  {:tag String}

seancorfield added a commit that referenced this pull request Sep 26, 2024
Signed-off-by: Sean Corfield <sean@corfield.org>
@alexander-yakushev
Copy link
Contributor Author

I copied it from clojure.core/str and forgot to remove. I don't think it does anything useful given that individual arity return types are already type-tagged. I will remove this in the next PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants