DataBuilder framework is a high level logic execution engine that can be used to execute multi-step workflows. This engine currently powers the checkout system as well as diagnostics and other workflows at flipkart. You should look at this framework for the following scenarios:
- Multi-step work flow executions where each step is dependent on data generated from previous steps
- Executions can span one request scope or multiple
- Your systems works with reusable components that can be combined in different ways to generate different end-results
A few examples of the above would be:
- Checkout like systems where users can provide all or only a part of the data across multiple steps and depending on that a order might complete or return the control back to the user. As user fills more details the system moves closer towards the target finally generating the order once all details have been provided by the user.
- API gateways that combine data from multiple sources to generate a final response
The following are the salient features:
- Annotation based meta-data handling for builders
- Data flow analyzer and builder to genrate execution graphs on the fly depending on the supplied meta-data and targets
- Support for loops and transient data that make sense in the context of one execution scope
- Single and multi-threaded data flow executors
- Extremely low overhead (20 usec for single threaded executor)
- Exposes low level as well as high level API's for dynamically registering builders and targets and building flows
- Large number of test-cases that cover every aspect of the framework for reference
Data Builder Framework is conceptually inspired by build systems in general and Makefiles in particular.
###Terms
Before we get into the nitty-gritty details, lets go over the basic terminology:
- Data - The basic container for information generated by an actor in the system. Meta associated:
- Data - Name of the data
- DataBuilder - An actor that consumes a bunch of data and produces another data. It has the following meta associated with it:
- Name - Name of the builder
- Consumes - A set of Data that the builder consumes
- Prodcues - Data that the builder produces
- DataFlow - A specification and container for a topology of connected builders that generate a final data. It has the following meta:
- Name - Name of the dataflow
- Target Data - The name of the data being generated by this data flow
- Resolution Specs - If multiple builders known to the system can produce the same data, then this can be used to put an override specifying which particular builder will generate a particular data in context of this data flow.
- Transients - A set of names of Data that would be considered Transients for this case. (See later for a detailed explanation of transients)
- ExecutionGraph - A graph of connected and topologically sorted builders that are used by the execution engine to execute a flow.
- DataSet - A set of the Data provided by the client and generated internally by the different builders of a particular ExecutionGraph
- DataFlowInstance - An instantiation of DataFlow that contains it's own copy of ExecutionGraph and DataSet. This represents the execution context of a particular request.
- DataDelta - The set of new data that needs to be considered as input for a particular execution
- DataFlowExecutor - The core engine that uses the provided DataDelta to execute the ExecutionGraph present in the given DataFlowInstance. This will augment the DataSet within the DataFlowInstance with the non-transient Data genrated by the different DataBuilders. All newly generated by the engine (including generated transient data) is returned by the engine.
- DataSetAccessor - A typesafe utility for accesing data present in a DataSet. This is used inside the builders to generate data.
The library can be used directly from maven, or from local. ###Build instructions
-
Clone the source:
git clone github.com/flipkart-incubator/databuilderframework
-
Build
mvn install
Use the following repository:
<repository>
<id>clojars</id>
<name>Clojars repository</name>
<url>https://clojars.org/repo</url>
</repository>
Use the following maven dependency:
<dependency>
<groupId>com.flipkart.databuilderframework</groupId>
<artifactId>databuilderframework</artifactId>
<version>0.5.11</version>
</dependency>
The framework can be used in two modes:
- Flow within a request scope
- Flow across multiple requests
We will go over both the modes one by one. However, the basic theory remains the same. So, let's take a look at the basics first.
The basic flow is the following:
- Identify and Create Data Classes
- Create Builders
- Register builder meta-data (consumes, produces)
- Build (and optionally save) data flows specifying a target data
- Start accepting requests
- Create a data-flow instance
- Accept input data in a data-delta
- Execute the data-flow with the delta
- One of two things can happen at this time:
- All required data was given and generated and the target data was produced
- All required data was not present and the flow did not complete
- In both the above cases a map of all the data generated during the current flow execution are returned
- All non-transient data is added to the dataset present in the data flow instance
- More data can be provided in subsequent invocations of execute to complete the flow
- The executor uses data present in the data-delta and the data-set to generate more data and get to the terminal state where it has generated the specified target data
You need to implement the Data class if you want to explicitly name your data:
public class TestDataA extends Data {
//Members
public TestDataA(...) {
super("A");
//
}
//Accessors etc
}
If class name is sufficient as the name of your data you can use the DataAdapter generic class:
public class TestDataA extends DataAdapter<TestDataA> {
// Members
public TestDataA(...) {
super(TestDataA.class);
//
}
//Accessors etc
}
DataBuilders have to implement the DataBuilder class and override the process() function:
public class TestBuilderA extends DataBuilder {
@Override
public Data process(DataBuilderContext context) {
//Access already generated and/or provided data
DataSetAccessor dataSetAccessor = context.getDataSet().accessor();
TestDataA a = dataSetAccessor.get("A", TestDataA.class); //If name of data was A
TestDataB b = dataSetAccessor.get(TestDataB.class); //If data was derived from DataAdapter<TestDataB>
//Do something and Generate new data
return new TestDataC(a.getValue() + " " + b.getValue());
}
}
-
DataBuilderInfo - Used to annotate DataBuilder with meta-data.
- name - Logical name of the builder
- produces - Name of the data it produces
- consumes - Set of names of data this builder consumes
@DataBuilderInfo(name = "BuilderA", consumes = {"B", "A"}, produces = "C") public class TestBuilderA extends DataBuilder { @Override public Data process(DataBuilderContext context) { //Produce data } }
-
DataBuilderClassInfo - Used to annotate DataBuilder with meta-data. Uses types instead of string names.
- name - Logical name of the builder. (Optional)
- produces - Class type of the data it produces
- consumes - Set of class types of data this builder consumes
@DataBuilderClassInfo(produces = TestDataC.class, consumes = {TestDataA.class, TestDataB.class}) public class TestBuilderA extends DataBuilder { @Override public Data process(DataBuilderContext context) { //Produce data } }
The examples below assume that you are using annotations to provide meta about the builders. In all cases meta can be provided in the registration function themselves and is not a pre-requisite. It's mainly a shortcut.
There are two types of executors:
- SimpleDataFlowExecutor - Uses a single thread to execute and has very low overhead. Use this if your builders are extremely fast and parallel executions will only increase overhead. For example validators and checkers that work only on local data.
- MultiThreadedDataFlowExecutor - Parallelizes execution of builders at same level of execution in the ExecutionGraph (see above figures). Use this when you are making remote service calls or yours builders are slow for some reason.
Both executors are safe and can be shared between threads. There is no need to create executors multiple times. Also, executors provide hooks for handlers that will be invoked during flow executions. See JavaDoc for details.
final DataFlowExecutor executor
= new MultiThreadedDataFlowExecutor(Executors.newFixedThreadPool(10));
Data flows are created using the DataFlowBuilder. Creating a data-flow is a relatively expensive operation. A data flow object is safe and internals are immutable. As such it should be reused and shared across requests and threads.
Creating a data flow where you know the participating builders is dead simple.
final DataFlow imageUploadFlow = new DataFlowBuilder()
.withDataBuilder(new ImageStoreSaver())
.withDataBuilder(new ColorExtractor())
.withDataBuilder(new CurveExtractor())
.withDataBuilder(new ExifExtractor())
.withDataBuilder(new PatternExtractor())
.withDataBuilder(new ImageValidator())
.withDataBuilder(new ImageIndexer())
.withTargetData(ImageSavedResponse.class)
.build();
The above will stitch the flow based on DataBuilderClassInfo annotations on the different builders.
Frequently your application would have a bunch of builders that expose different computations in the system. And you would not want to hardcode flows but be able to build them on the fly using APIs to generate newer workflows without deploying new code.
In this scenario you'd want to use something like Reflections library to discover such computations and register them in DataBuilderMetadataManager. Use this DataBuilderMetadataManager in the DataFlowBuilder and provide a target data; the system builds a data flow on the fly, selecting the relevant builders and connecting them properly to be able to generate the final data. In the case where multiple known builders generate the final data, a resolution spec can be specified as well while building the flow. You'd typically want to cache/store these computed data flow and execute requests.
Err, well, call the register function.
DataBuilderMetadataManager dataBuilderMetadataManager = new DataBuilderMetadataManager();
//Typically the register function would be called in a loop over
//annotated DataBuilder implementations found in the classpath
dataBuilderMetadataManager
.register(ImageStoreSaver.class)
.register(ColorExtractor.class)
.register(CurveExtractor.class)
.register(ExifExtractor.class)
.register(PatternExtractor.class)
.register(ImageValidator.class)
.register(ImageIndexer.class)
.register(ImageSavedResponse.class)
.register(ImageResponseGenerator.class);
Build data flows by specifying the targets
final DataFlow imageUploadFlow = new DataFlowBuilder()
.withMetaDataManager(dataBuilderMetaDataManager)
.withName("ImageUpload")
.withTargetData(ImageSavedResponse.class)
.build();
final DataFlow imageGetFlow = new DataFlowBuilder()
.withMetaDataManager(dataBuilderMetaDataManager)
.withName("ImageGet")
.withTargetData(ImageDetailsResponse.class)
.build();
Executions can be on a single request or in checkout type scenarios in multiple requests.
Get input data, execute and return the values.
public ImageSavedResponse save(final Image image) throws Exception {
DataFlowExecutionResponse response = executor.run(imageUploadFlow, image);
return response.get(ImageSavedResponse.class);
}
In this case, executions are more complex and goes like this:
- Create a DataFlowInstance for a user activity context. This represents for example a checkout session.
- For transactionality across flow changes, every DataFlowInstance should contain it's own copy of DataFlow
- Every invocation of the executor, the DataFlowInstance needs to be passed in along with a set of input data
- The new input data (also called as delta) is added to the DataSet considered for this execution
- The system starts to execute and proceeds through the flow as far as it can
- Every step of execution, the freshly produced data is added to the active DataSet
- A builder may choose to return null if it's inputs are not rich enough for it to generate data
- If terminal state is not reached in a flow, the system loops again from the start, with the DataSet, now augmented with data generated from previous execution
- The system stops when:
- Target Data for the flow has been generated
- No new data has been generated in an iteration
- Augmented DataSet is saved in the instance. Transient data are not added
DataFlowInstance instance = new DataFlowInstance("test-123", checkoutFlow.deepCopy());
Example: Multi-step checkout
//Step 1: Login and address
response = executor.run(instance, userDetails);
//Step 2: Cart details and edits
response = executor.run(instance, cart);
//Step 3: Payments
response = executor.run(instance, payment);
//Done
Example: Single page checkout
response = executor.run(instance, userDetails, cart, payment);
//Done
That's it. The libary is extensively tested and documented. Go through the JavaDocs, there are a lot of details present on the same.
0.5.3
For bugs, questions and discussions please use the Github Issues.
If you would like to contribute code you can do so through GitHub by forking the repository and sending a pull request.
When submitting code, please make every effort to follow existing conventions and style in order to keep the code as readable as possible.
By contributing your code, you agree to license your contribution under the terms of the APLv2: http://www.apache.org/licenses/LICENSE-2.0
All files are released with the Apache 2.0 license.
If you are adding a new file it should have a header like this:
/**
* Copyright 2015 Flipkart Internet Pvt. Ltd.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
Copyright 2015 Flipkart Internet Pvt. Ltd.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.