Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: New higher-performance Filter() operators #758

Open
JakenVeina opened this issue Nov 18, 2023 · 17 comments · May be fixed by #941
Open

[Feature]: New higher-performance Filter() operators #758

JakenVeina opened this issue Nov 18, 2023 · 17 comments · May be fixed by #941
Assignees

Comments

@JakenVeina
Copy link
Collaborator

JakenVeina commented Nov 18, 2023

Describe the functionality desired 🐞

Consider the following new filtering operators, which should provide performance improvements over existing operators:

public static IObservable<IChangeSet<TObject, TKey>> Filter<TObject, TKey, TState>(
    this    IObservable<IChangeSet<TObject, TKey>>  source,
            IObservable<TState>                     state,
            Func<TState, TObject, bool>             predicate,
            bool                                    suppressEmptyChangeSets = true);

public static IObservable<IChangeSet<TItem>> Filter<TItem, TState>(
    this    IObservable<IChangeSet<TItem>>  source,
            IObservable<TState>             state,
            Func<TState, TItem, bool>       predicate,
            bool                            suppressEmptyChangeSets = true);

public static IObservable<IChangeSet<TObject, TKey>> FilterValues<TObject, TKey>(
    this    IObservable<IChangeSet<TObject, TKey>>  source,
            Func<TObject, bool>                     predicate,
            bool                                    suppressEmptyChangeSets = true);

Not set on the name for the third one, but it stems from a conversation with @RolandPheasant and @dwcullop in Slack. Alternatives for this would maybe be .FilterSlim() or .FilterDeterministic() or .FilterWithoutCaching(). Whatever the name, the goal would be to keep it consistent with other operators (like .Transform()) that normally keep an internal items cache, but could have versions that don't.

The steps the functionality will provide

For the first two operators, utilizing an IObservable<TState> parameter, the primary improvement operator is to reduce memory allocations required by existing operators that support dynamic filtering.

The third operator would improve performance for static-filtering scenarios, where the filter predicate is relatively fast, and is known to be deterministic, I.E. collection items are immutable, at least from the perspective of filtering. Under this scenario, .Filter() can be implemented without the need for an internal cache of items, becoming allocation-free for all single-change ChangeSets.

This operator would not be applicable as an ObservableList operator, as using deterministic-filtering does not eliminate the need for an internal items cache. An internal cache is still needed to track item indexes, so downstream indexes can be adjusted for items filtered out of the source.

Considerations

  • Consumers can currently use .Filter() with the IObservable<Func<TObject, bool>> or IObservable<Func<TObject, TKey, bool>> parameters to dynamically control filtering, but this requires either a new Func<> allocation for each change in filtering rules, or a decent bit of tedious code to reuse and re-emit a shared Func<> instance each time, while instead mutating its closure.
  • Consumers can currently use .FilterOnObservable(), but this requires multiple object allocations per-added-item, to create an IObservable<bool> that controls filtering of that item, and to subscribe to that stream. This approach also doesn't really support collection items being immutable, at all, as replace operations force a teardown and reconstruction of the filtering stream for that item.
@RolandPheasant
Copy link
Collaborator

I think the filter with state option is inspired, and now that you have suggested I am surprised that no-one, myself included have ever to introduce it. I think it's a no-brainer that we should go with it 😃

Regardless the stateless filtering, I see a pitfall:

  1. An item is added and matches the filter.
  2. An update arrives which no longer matches the filter.
  3. This means that the original item should be removed.

We could not deterministically specify that 3 should be a remove without knowing that it was originally a match. Alternatively we could always fire a remove, but that would be a load of misleading notifications. The same would be true when we determine an update after an add. I think most operators could handle this but I think it would take a lot of looking to be sure.

@JakenVeina
Copy link
Collaborator Author

JakenVeina commented Nov 20, 2023

We could not deterministically specify that 3 should be a remove without knowing that it was originally a match.

Sure we can, we re-run the predicate for .Previous, because one of the assumptions for consumers to fulfill is that they don't mutate items. That's where the other assumption that "the filter predicate runs relatively fast" comes from. Or is .Previous not always populated for replacement changes?

@RolandPheasant
Copy link
Collaborator

Of course. You're right. The third operator must assume that the object does not mutate and that the filter is a pure function.

@dwcullop
Copy link
Member

For clarity, I want to make sure I understand the use-cases, which might look something like this?

class Person
{
    public int Id { get; }
    public string Name { get; set; }
    public int Age { get; set; }
}

var cache = new SourceCache<Person, int>(p => p.Id);
var ageObservable = viewModel.WhenAnyValue(vm => vm.MinAgeFilter);  // Connected to some UI, for example

var filtered = cache.Connect().Filter(ageObservable, static (age, p) => p.Age >= age)
                      .Bind(out this.filteredByAgeCollection)
                      .Subscribe();

And that would yield filteredByAgeCollection having the subset of people with the minimum age value?
And this is an improvement because the current options require the ageObservable to be transformed into the actual predicate and that could require extra closures and allocations, etc.?

Am I understanding everything correctly? If so, I like it.

@RolandPheasant
Copy link
Collaborator

That looks about right

@dwcullop
Copy link
Member

We could not deterministically specify that 3 should be a remove without knowing that it was originally a match. Alternatively we could always fire a remove, but that would be a load of misleading notifications. The same would be true when we determine an update after an add. I think most operators could handle this but I think it would take a lot of looking to be sure.

I think we should avoid emitting "misleading notifications" (the first event observed for a given Key should always be Add, the last should always be Remove, there shouldn't be another Add unless there has been a remove, and Update and Refresh should be appear between these two, etc.). We could even modify the Aggregator to validate that operators always have this behavior via testing.

Operators should also be written to work-around/ignore misleading notifications as much as possible. These two approaches together should avoid bugs. But consumers would likely be surprised to receive a Remove for something for which there was never an Add. I know I would be.

We could not deterministically specify that 3 should be a remove without knowing that it was originally a match.

Sure we can, we re-run the predicate for .Previous, because one of the assumptions for consumers to fulfill is that they don't mutate items. That's where the other assumption that "the filter predicate runs relatively fast" comes from. Or is .Previous not always populated for replacement changes?

I understand how applying the predicate to the Previous will enable emitting the correct Add/Remove in response to an Update. I hadn't considered that and I think that's a great idea. However, what about the following case?

  1. Add Event fires
  2. Predicate is applied with the latest from the State observable and returns false
  3. Add Event is omitted from the downstream events
  4. State observable fires with an updated value that would make the Predicate return true
  5. ????

How do we find the item that was just dropped so that an Add event can now be omitted unless there's a local cache of all of the items? You have the same issue with any State observable change. I feel like you need the local cache of all items so that the filter can be re-applied ... Or am I missing something?

@JakenVeina
Copy link
Collaborator Author

State observable fires with an updated value that would make the Predicate return true.

The approach of applying the predicate to .Previous instead of keeping a cache of all items only works when the predicate is stateless/static, for specifically this reason. I.E. of the proposed new extensions, only .FilterValues() avoids keeping an internal cache, because it specifically doesn't support a changing predicate, or predicate state.

@dwcullop
Copy link
Member

The approach of applying the predicate to .Previous instead of keeping a cache of all items only works when the predicate is stateless/static, for specifically this reason. I.E. of the proposed new extensions, only .FilterValues() avoids keeping an internal cache, because it specifically doesn't support a changing predicate, or predicate state.

But it isn't stateless. The other observable is even named IObservable<TState> state. Even if the filter is deterministic with regard to the Cache's value, when a new value comes in from the state observable, doesn't that mean all of the values have to be re-evaluated? At least, that's what it seems like to me because the predicate is Func<TState, TObject, bool> predicate. Am I missing something?

@JakenVeina
Copy link
Collaborator Author

Am I missing something?

Yes, you're missing the fact that that overload won't be using the "just apply the predicate to .Previous instead of keeping a cache of all items" approach. The only one that will is the one that doesn't have a state parameter. .FilterValues().

@dwcullop
Copy link
Member

Got it, thanks!

@JakenVeina
Copy link
Collaborator Author

Still need to implement the two TState variations.

@dwcullop
Copy link
Member

dwcullop commented Mar 3, 2024

Still need to implement the two TState variations.

What does that look like?

@JakenVeina
Copy link
Collaborator Author

Exactly what they look like in the original issue description.

@dwcullop
Copy link
Member

dwcullop commented Mar 9, 2024

Exactly what they look like in the original issue description.

Are you going to implement those? I might be able to do it.

@JakenVeina
Copy link
Collaborator Author

I've got an implementation in progress, yeah, no worries. The cache version is done, and I was hoping for a little feedback from @RolandPheasant in Slack, with regard to the list version.

@JakenVeina
Copy link
Collaborator Author

JakenVeina commented Apr 8, 2024

Got a fun result here.

| Method                           | FilterPolicy    | Mean      | Error    | StdDev   | Ratio | RatioSD | Gen0     | Gen1     | Allocated | Alloc Ratio |
|--------------------------------- |---------------- |----------:|---------:|---------:|------:|--------:|---------:|---------:|----------:|------------:|
| RandomizedEditsAndStateChanges_1 | ClearAndReplace |  39.27 ms | 0.770 ms | 0.917 ms |  1.00 |    0.00 | 923.0769 |  76.9231 |   3.83 MB |        1.00 |
| RandomizedEditsAndStateChanges_2 | ClearAndReplace |  25.55 ms | 0.509 ms | 0.865 ms |  0.66 |    0.03 | 468.7500 |  62.5000 |   1.88 MB |        0.49 |
| RandomizedEditsAndStateChanges_3 | ClearAndReplace |  39.71 ms | 0.783 ms | 2.063 ms |  1.03 |    0.04 | 500.0000 | 142.8571 |   2.01 MB |        0.52 |
|                                  |                 |           |          |          |       |         |          |          |           |             |
| RandomizedEditsAndStateChanges_1 | CalculateDiff   | 122.78 ms | 2.274 ms | 3.261 ms |  1.00 |    0.00 | 750.0000 |        - |   3.87 MB |        1.00 |
| RandomizedEditsAndStateChanges_2 | CalculateDiff   |  72.88 ms | 1.357 ms | 2.481 ms |  0.60 |    0.02 | 857.1429 | 142.8571 |   3.56 MB |        0.92 |
| RandomizedEditsAndStateChanges_3 | CalculateDiff   |  40.89 ms | 0.803 ms | 1.604 ms |  0.33 |    0.01 | 909.0909 |  90.9091 |   3.63 MB |        0.94 |

I worked out 3 different implementations for the List version of this operator, and the one that seems to be the clear winner for performance (Version 3) is actually a FUNCTIONAL improvement as well: it preserves item ordering, through the filter. E.G. if unfiltered items are [A, B, C, D, E] and B and D get filtered out, the downsteam is guaranteed to be [A, C, E].

Verision 1 is the closest implementation to the current List .Filter() operators: it keeps 2 lists internally, an unfiltered one, and a filtered one. Items are always added to the end of the filtered list, so item ordering is deterministic, but not preserved compared to the unfiltered list. When removals need to be processed, the filtered list is linearly searched to find the item.

Version 2 attempts to improve on Version 1 by only keeping one list, internally, with each item being stored alongside its filtered index, if any. Removes can now be processed without a linear search for the item, but it instead requires a full iteration of the list to update the "filtered index" values, in accordance with the removal.

Version 3 is similar to Version 2, with additional work put in for Add changes to preserve order, which then allows a variety of small optimizations when processing other change types, based on the assumption that "filtered index" values are in-order.

For extra fun, if we go with Version 3, I think we could theoretically re-implement all the existing List .Filter() operators to use this one, under-the-hood.

@JakenVeina
Copy link
Collaborator Author

JakenVeina commented Apr 8, 2024

Regarding the Cache version of the operator...

| Method                           | Mean     | Error   | StdDev   | Ratio | RatioSD | Gen0      | Allocated | Alloc Ratio |
|--------------------------------- |---------:|--------:|---------:|------:|--------:|----------:|----------:|------------:|
| RandomizedEditsAndStateChanges_1 | 324.9 ms | 6.37 ms |  5.96 ms |  1.00 |    0.00 | 2000.0000 |   8.57 MB |        1.00 |
| RandomizedEditsAndStateChanges_2 | 325.6 ms | 6.38 ms |  6.55 ms |  1.00 |    0.02 | 2000.0000 |   8.58 MB |        1.00 |
| RandomizedEditsAndStateChanges_3 | 356.8 ms | 6.94 ms | 10.17 ms |  1.10 |    0.03 | 2000.0000 |   8.57 MB |        1.00 |
| RandomizedEditsAndStateChanges_4 | 327.3 ms | 6.48 ms |  8.43 ms |  1.01 |    0.03 | 2000.0000 |   8.57 MB |        1.00 |
| RandomizedEditsAndStateChanges_5 | 325.5 ms | 6.20 ms |  7.84 ms |  1.00 |    0.03 | 2000.0000 |   8.57 MB |        1.00 |

All the slightly-different implementations I've tried have essentially the same performance.

The different things I've tried include:

  • Using only native operators (Version 2)
  • Using dual-locking, like the List version of the operator (upstream and downstream are locked separately, but have overlap, to preserve ordering of changesets). (Versions 3 and 5)
  • Having all the code on a dedicate Subscription class, to promote readability, as there's quite a lot of code. (Versions 4 and 5)

I'm planning on going with Version 5.

@JakenVeina JakenVeina linked a pull request Sep 8, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants