-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
DISCUSSION PROPOSAL Polly eventing and metrics architecture
Version: 0.8
Authors/Contributors: Dylan Reisenberger @reisenberger
Thanks also to: @lakario; @ankitbko; @seanfarrow; @tomkerkhove whose earlier ideas, comments and prototypes have already influenced this.
Date: 26 January 2018
Status: Proposals for discussion.
Comment via this github issue or join our slack channel for metrics updates and discussion.
Type/interface hierarchy? Or keep it relatively flat (perhaps single sealed type), extensible with dictionary-like semantics for data specific to individual policy/event types? Or combination?
Producers and consumers need a common way of specifying and identifying event types.
Either a separate nuget package named (eg) Polly.EventTypes
or Polly.Events.EventTypes
, which both producers and consumers reference. Or keep these in main Polly
package.
If using a type hierarchy to distinguish events by .NET type, that distinguishes, for .NET consumers.
If events are ever to be consumed by a non-.Net platform, a string-constant/enumeration-like identifier could also be useful. A quasi-enumeration of policy event types could be: (example, not necessarily exact structure)
Polly.Retry.Events.OnRetry = "Retry.OnRetry"; // or just "OnRetry", if identified to "Retry" policy elsewhere`
Polly.Retry.Events.OnRetrySuccess = "Retry.OnRetrySuccess";
Polly.CircuitBreaker.Events.OnOpen = "CircuitBreaker.OnOpen";
Polly.CircuitBreaker.Events.OnClose = "CircuitBreaker.OnClose";
Polly.Fallback.Events.OnFallbackInvoked = "Fallback.OnFallbackInvoked";
If so, probably string
rather than pure enum
. Users coding custom policies may want to add custom event types: string
is open for extension while a pure enum
is closed. public static string
can still provide compile-time-bound matching, eg .Where(e => e.EventType == Polly.CircuitBreaker.Events.OnBreak)
if use of a type hierarchy is not available.
A PolicyEvent would comprise (thoughts so far) three main kinds of data
- 1 Metadata: common to all event types
- 2 Event data: Data specific to the given policy type and event type
- 3 User data: Custom data which the user could add to events
A property-value/key-value store of metadata common to all events:
- PolicyWrapKey: The key of the PolicyWrap (if applicable) executing
- PolicyKey: The key of the Policy generating the event
- ExecutionKey: (better renamed?: CallSiteKey): A key identifying the call site within the code generating the event. Potentially differs from PolicyKey, as a policy instance may be re-used in multiple call sites.
- ExecutionGuid: a Guid distinguishing this particular execution.
And:
- SourceTimestamp: UTC timestamp of the time in the source system at which the event was raised
See later discussion also on capturing call execution time, re:
- SourceTimerTicks: Tick count in the source system at which the event was raised
- SourceTicksPerSecond: resolution of ticks at source
And:
- possibly PolicyType: eg Retry; CircuitBreaker; Fallback.
- EventType: String constant indicating the type of event. Drawn from a quasi-enumeration.
More?
A property-value/key-value store of data specific to the policy type and/or event type. This might contain a mixture of configuration information and state. For instance, for retry policy events it might contain:
- MaxRetries: 3
- CurrentTry: 1
For advanced circuit-breaker, eg:
- FailureThreshold: 0.5
- (other similar configuration data)
- CircuitState: HalfOpen
- CircuitBrokenUntil: time
- (etc)
Any value in splitting this into two sections?: policy-constant elements (eg configuration) and varying elements (temporal state, or perhaps data particular to that one event)? (Is distinction sufficiently clear to always maintain, always add value?). Probably becomes clearer as we list out events per policy type.
Should configuration be (i) written to each event, and/or (ii) modelled as a separate stream? (Compare: https://github.com/Netflix/Hystrix/wiki/Metrics-and-Monitoring#configuration-stream). Perhaps initially (i), then later also (ii). Separate developments are underway within Polly around dynamic reconfiguration.
A key-value store of data the user may wish added to each event.
Users might have their own metadata they attach to executions via Polly's Context
, which they may want to expose to eventing dashboards / downstream metrics ingestion. Examples:
- a
CorrelationId
tracking the progress of an original user request among downstream microservice interactions - the application/component generating the event
- when horizontally-scaling, the instance/node which generated the event
Idea: Users could specify a Func<Context, IEnumerable<KeyValuePair<string, object>>>
as a projection of user data from Context
. If so, it certainly must be a selective Func
like this rather than simply serialize the whole of Context
. Context
may include sensitive user data which it would be inappropriate to distribute; serializing the whole of Context
likely wasteful.
We should possibly consider compatibility/convertibility of the Polly event format to other formats we might want to interface with, such as Azure Event Grid, input to AppInsights, Hystrix dashboard, etc.
Anyone is welcome to put time into these comparisons and draw out anything we need to learn.
Layers could be:
- (1) Core
Polly
package - (2a)
Polly.Events.Rx
- (2b)
Polly.Metrics.Rx
- (3)
Polly.Metrics.Rx.AppInsights
,Polly.Metrics.Rx.HystrixDashboard
(etc)
The core Polly
package should ideally not take a dependency on Rx.
We may be able to rely on the fact that System.IObservable<T>
is in the core BCLs, outside Rx.
Or It may be that core Polly
policies should expose a traditional .NET event
hook (or similar: see discussion on de-duplication) for raising initial events.
Testability may influence the choice.
Separation between (2a) and (2b) may initially look / be unnecessary, but see later discussion about shipping off box.
(2a) Polly.Events.Rx
provides an implementation that ensures layer (1) can emit an Rx stream of events.
Perhaps Observable.FromEvent(...)
pattern (or similar) if (1) is based around .NET event
s (or similar).
Or if (1) expresses signatures in System.IObservable<T>
, it might be an Rx event-pump that could be injected into policies in (1) to turn on the events.
(2b) Polly.Metrics.Rx
would offer a range of Rx functions aggregating events to create a range of standard metrics: (below quick examples, would be many more)
- on retry policies, a ?rolling-window gauge? of the average number of retries needed to achieve success
- on cache policy, rolling cacheHit/cacheMiss ratio
- on
Policy
s andPolicyWrap
s, average call execution time
The main classes of info producible as information aggregated from source events are:
Informational:
- simple informational properties, eg the name of the
PolicyKey
- configuration properties, eg how many retries are configured for this retry policy
Counts:
- pure integer. Count of how many times something has happened since metrics tracking started.
- Would this need to be accompanied by a timestamp of when the count is since??
Timer:
- how long something took, eg call execution time.
Gauge:
Both count and timer measurements may then be averaged across a recent window of time to produce a rolling gauge. For example:
- Average number of retries needed for success on this call channel was 1.2, in the last 60 seconds
- Average execution time of calls through this channel was 4.3ms , in the last 60 seconds
Terminology
Similar implementations (eg Hystrix and StatD) use these terms in differing manners: research, and clarify our usage? Ref: https://github.com/Netflix/Hystrix/wiki/Metrics-and-Monitoring#bucketed-event-counters (and following); http://statsd.readthedocs.io/en/v0.5.0/types.html
2.1.3 Metrics architecture: (3) Polly.Metrics.Rx.AppInsights, Polly.Metrics.Rx.HystrixDashboard (etc)
Packages such as Polly.Metrics.Rx.AppInsights would transform/render metrics for input to an individual dashboard. Consumers need not only be dashboards: they could also be (eg) logging systems, or alert-raising systems.
Suggest: these layers are as 'dumb' as possible: they should not contain any knowledge/processing which manipulates data from the layer below into some further useful aggregation or statistic. If a new, useful aggregation or statistic is conceived, the logic/function to create that aggregation should ideally be implemented in (2b), so that it may be useful to other dashboards; layer (3) should only manipulate metrics computed by (2b) into formats acceptable to the consumer (3) targets. Analogy: similar to (3) being a 'dumb' view/view-model in MVC or MVVM; 'thinking' happens in lower layers.
We need to consider when to ship the Rx work off the user's thread: http://www.introtorx.com/content/v1.0.10621.0/15_SchedulingAndThreading.html. At (2b) Polly.Metrics.Rx
?
For high-throughput systems, it may become important to be able to ship events 'off box' (off the main production server, towards processing capacity dedicated just to handling events/metrics), and to offer options for when to do so. Other high volume stats/metrics implementations like StatsD and Hystrix consider this.
The suggested division of packages above pre-plans for a couple of options for when to ship:
(1)-(2a) -> ship raw events off app servers -> (2b)--?>-(3)
: Ship raw events (2a) off box before aggregating (2b). Increases network traffic but decreases metrics CPU load on the app servers.
or
(1)-(2a)-(2b) -> ship aggregated stats off app servers -> (3)
: Aggregate stats (2b) still on app servers, before shipping off box. Decreases network traffic, but increases metrics CPU load on app servers.
Perhaps not for first implementation/don't need an answer immediately, but dividing (2a) and (2b) as separate packages [or keeping in mind the ability to do so] would forward plan for this.
When shipping off box, consider also batching events.
Any policy type could emit two kinds of timing information (or the events necessary to calculate them):
- Elapsed execution time of the overall policy execution, including work done by the policy code
- Elapsed execution time of the user delegate execution, excluding work done by the policy code
These could be achieved by each policy instance emitting events:
PolicyExecutionStart
DelegateExecutionStart
DelegateExecutionEnd
PolicyExecutionEnd
The events detailed above PolicyExecutionStart
, DelegateExecutionStart
(etc) - and indeed any other event - could include long Ticks
properties, allowing duration calculations by subtraction.
Options for Ticks
sources:
Stopwatch
traditionally recommended as more precise, and is also preferred as a monotonic clock over a time-of-day clock.
To avoid repeated extra allocations of Stopwatch
instances, Polly could run a thread-safe singleton Stopwatch
instance or use .GetTimestamp()
. int64.MaxValue
compared to Stopwatch.Frequency
makes this viable for app lifecycles without overflow.
If so, the central Stopwatch
instance could/should be abstracted and replaceable-by-property-injection, in the same manner core Polly
already abstracts SystemClock
, to support unit testing.
A Stopwatch
obviously rebases to zero on each start and is not synchronized between processes: Ticks
values from the stopwatch would not be comparable across different running processes using Polly, only good for subtraction between policy events from the same process. Layer (2b) metrics aggregation should expose only durations, and mask the source ticks, to prevent inadvertent downstream misuse of non-correlating ticks from multiple Polly processes.
Absolute DateTime.UtcNow
at event source should likely still be emitted alongside stopwatch ticks in events, as informational for logging.
A Polly PolicyWrap
is intentionally free-form (policies can be wrapped in any combination) in a way that a Hystrix command is not.
This means that there will be no such thing as a one-size-fits-all Polly dashboard. However:
-
For
PolicyWrap
: We can offer standard statistics which apply whatever the composition of thePolicyWrap
, eg overall execution time.- It is relatively trivial also for
PolicyWrap
to identify when it is executing the innermost or outermost policy of the wrap, allowing related statistics.
- It is relatively trivial also for
-
For individual
Policy
types: We can offer common metrics, and common dashboard visualizations of them:- eg for retry, average number of tries needed;
- for circuit-breaker, percentage of time circuit open/closed in recent window, etc.
An additional possibility would be to develop a standard set of PolicyWrap
metrics which work across a semi-standard PolicyWrap
including (optionally) (up to 1 of) each main Policy
type, in a specific sequence, likely: Fallback
-> Cache
-> Retry
-> CircuitBreaker
-> Bulkhead
-> Timeout
(this sequence corresponds to the PolicyWrap wiki). This would also offer a prototype which others could adapt from if they have bespoke PolicyWrap
s featuring policy types in different sequence or number.
We may want convenience methods/properties on PolicyWrap
to:
- enable events for all individual
Policy
s in aPolicyWrap
- subscribe to a stream of events emitted by all
Policy
s in aPolicyWrap
We may need to consider how to do this without creating unintended duplicate subscriptions (at whatever layer) and possible duplicate events. Possible to tag events with Guids and de-duplicate, but better to architect not to allow. May not arise depending on architecture. Intended multiple subscription should be permitted.
A separate document/discussion will cover the events to be emitted - and thus the statistics which could be aggregated - for each policy type.
- Home
- Polly RoadMap
- Contributing
- Transient fault handling and proactive resilience engineering
- Supported targets
- Retry
- Circuit Breaker
- Advanced Circuit Breaker
- Timeout
- Bulkhead
- Cache
- Rate-Limit
- Fallback
- PolicyWrap
- NoOp
- PolicyRegistry
- Polly and HttpClientFactory
- Asynchronous action execution
- Handling InnerExceptions and AggregateExceptions
- Statefulness of policies
- Keys and Context Data
- Non generic and generic policies
- Polly and interfaces
- Some policy patterns
- Debugging with Polly in Visual Studio
- Unit-testing with Polly
- Polly concept and architecture
- Polly v6 breaking changes
- Polly v7 breaking changes
- DISCUSSION PROPOSAL- Polly eventing and metrics architecture