TODO:
- Think about fuzzy matching for reduce / entity resolution
- Support equijoins
- Inputs should be accessed as input['title'] instead of just title / everything be jinja
- Make flatten a separate operator with flatten_key (or nothing)
- Convert parallel flatmap to parallel map
- Write documentation & restructure codebase
- Write tests
- Chunking/splitting with peripheral chunks
- Write build phase
- Add keys / inputs to reduce
- For reduce we should pass through keys
- Optimize maps
- Track costs for the optimizer
- Generate multiple plans and evaluate them instead of generating one plan
- Don't use an LLM to determine the right chunk size; try searching many chunk sizes
- Call llm agent multiple times on different random inputs & average results
- Decompose map to be a chain or parallel map
- Debug finnicky combine prompts
- Optimize resolvers (add blocking rules)
- Optimize reduce
- Implement fold pattern
- Optimize folds
- Stratified sample the reduce operations based on the groupby results
- Synthesize multiple fold prompts
- Implement merge pattern
- Optimize merges
- Synthesize merge prompts
- Derive num_parallel_folds in the reduce op itself (saving the runtimes of folds and merges)
- Try various batch sizes
- Optimize equijoins
- Support multiple operator workflows in the optimizer
- Calculate explosion factor
- Incorporate selectivity estimates in the multi-operator optimization
- Write gleaning optimization step
- Incorporate gleaning in reduce
- Support a summary type reduce, where we sample k elements to put in the prompt do a batch reduce
- Write a non-associative reduce
- Write documentation on how all the operators work
- Auto-generate resolver
- Support summarizing peripheral chunks
- Change validation to be pairwise comparisons (for map, at least) (Aug 14 & 15)
- Only compare the plans that are highest scoring
- Support unnesting
- Reduce operator: support reduce keys as list
- Refactor map optimizer
- In map optimizer, when creating a split, add a uuid to each record being split (instead of relying on some doc id)
- Recursively optimize operations (e.g., reduces in maps) (Aug 16 & 17 & 19)
- In map optimizer: if the submap output is a list, then we should add an unnest operation
- In reduce optimizer: query agent if we should drill-down / do a subreduce
- In map optimizer: prune the chunk size plans that don't give individually good results for the chunks
- In map optimizer: optimize the reduce operator for each chunk size plan
- In reduce optimizer: synthesize resolver if need be
- In resolve optimizer, support list-type reduce keys
- Operator reordering
- support equivalence: map -> unnest -> reduce might be same as split -> gather -> map -> unnest -> reduce (no need to have a reduce right after map)
- Run tests in CI
- Support retry on validation failure
- Break down split into split + gather (Aug 21 & 22)
- Support this in runner too
- Support more flexible chunking strategies
- Delimiter based splitting
- Encode this in API somehow
- Support this kind of chunking in the optimizer
- Extract headers & levels from documents, and add the level hierarchy to the chunk.
- Delimiter based splitting
- Support tool use in map operators
- Support prompts exceeding context windows; figure out how to throw out data / prioritize elements
- Support retries in the optimizers
- Operations should not be defined as dictionaries; they should be objects
- Support unnests in the optimizer
- Print out the name of the plan we are synthesizing
- Add gleaning plan to reduce
- Reduce optimizer should get a human to confirm if a drill-down roll-up decomposition makes sense
- Allow gleaning model to be different from the main op model
- HITL for prompt selection (generally, a textual app)
- Fix bug in recursively optimizing reduce in the map optimizer
- Support reduce key of "all"
- Change reduce_key to group_py
- Write tests for optimizers
- Refactor reduce and join optimizers
- Support prompt prefix/custom instructions so users don't have to put them in every operation
- Filter optimizer
- Extend map optimizer to support filter
- Train an embedding classifier for filter
- Support passing expectations
- Write intermediates to disk
- Support order by
- Reduce operations: eagerly process merges to prevent against stragglers/tail latencies in folds?
- Rewrite API for equijoin input data
- Allow for few-shot examples for each operation, to use for the optimizer (especially joins)
Things to think about
- Filter chunks before applying the map prompt
- Reduce does not need to be an LLM call:
- it can just be a concatenation of the inputs to the potential LLM call
- it could also be some normal aggregation (e.g., summing up the counts of symptoms, doing a conjunction or disjunction of intermediate outputs for a filter operation)
- If the user specifies a map call in 2 different ways, they should get the same result. E.g., say they want to get a list of all the symptoms referenced in the medical transcript and what caused the symptoms.
- Resolves should support resolves within groups, not necessarily a global resolve
- Synthesize empty resolve in either builder or reduce optimizer, not both
- Figure out how to run validators when data is too large to fit in the prompt (need to randomly sample part of the document)
- In reduce optimizer: if agent suggests drill-down, see if we need to add a map to create the subreduce keys, or the subreduce key already exists
- Try various combine prompts in the reduce optimizer
- Filter optimizer: we should recursively optimize reduces if the reduce isn't good on its own
- Support retry on val failure for operations beyond map/filter
- If reduce input is too big to fit in the prompt, prompt for a map operation
- Pipeline optimization: group maps and reduces together after one pass of optimization