Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create partitions with more similar size #619

Open
ajnavarro opened this issue Nov 16, 2018 · 10 comments
Open

Create partitions with more similar size #619

ajnavarro opened this issue Nov 16, 2018 · 10 comments
Labels
performance Performance improvements proposal proposal for new additions or changes research Something that requires research

Comments

@ajnavarro
Copy link
Contributor

ajnavarro commented Nov 16, 2018

Partitions with similar size

Description

Right now, our partition implementation is a partition per repository, that means, partitions can have really different sizes, causing problems like really long times processing 5% of the total amount repositories (only one thread parsing a huge repo).

With this proposal, I want to normalize partition size changing the way we split the data into several threads.

Proposal

Change actual partition ID:

[REPOSITORY_ID]

To:

[REPOSITORY_ID]+[PACKFILE_HASH]+[OFFSET_FROM]+[OFFSET_TO]

Values description

  • [REPOSITORY_ID]: gitbase ID identifying the repository. Examples gitbase, kubernetes, spark.
  • [PACKFILE_HASH]: packfile hash. Example: ccbe18d4ba50a14a28855a7b53880d93d87dc498
  • [OFFSET_FROM]: packfile offset where the first object for that partition is. Example 23144
  • [OFFSET_TO]: packfile offset where the last object for that partition is. Example 7009211

How to do repartition

We need to decide the number of objects that will be on each partition to generate the offsets. That object count will be a magic number and won't be changed. If we change it over configuration, indexes and other internal modules will not work correctly.

Partitions will be done consulting the packfile indexes when PartitionIter is requested. All the partitions will have the same amount of objects, excluding the last one. This operation should be fast enough to do it per query. We can cache the partition names if necessary.

Taking into account repositories updates, this approach is totally valid. Partitions are related to the packfile hash, so the content cannot change, only be deleted or create new packfiles.

How to handle this partitions on each main tables

Object tables (Commits, Blobs, Trees, Tree entries)

Each partition will only iterate the objects on their ranges, so no objects will be duplicated at the output.

References

Only references that are pointing to an object inside the partition range will be an output for the partition. To do this, we can use the packfile indexes and should be fast.

Files

Files will be appearing only on the partition where the root tree entry is present. Might be necessary to get tree_entries from other packfile ranges to generate the path using the repository.

Caveats

Problems

  • We need to check how go-git behaves when several threads are using it.

Benefits

  • Partitions will be more homogeneous and will take more or less the same time to process, giving to the user a good approximation about the size of the query and remaining time.
@ajnavarro ajnavarro added proposal proposal for new additions or changes performance Performance improvements research Something that requires research labels Nov 16, 2018
@smola
Copy link
Contributor

smola commented Nov 28, 2018

Files will be appearing only on the partition where the blob is present. Might be necessary to get tree_entries from other packfile ranges to generate the path using the repository.

I think it will be easier to implement and more consistent with other tables if we partition according to root tree_entry, not blob hash.

Only references that are pointing to an object inside the partition range will be an output for the partition. To do this, we can use the packfile indexes and should be fast.

Can't we just partition them as any other table?

@ajnavarro
Copy link
Contributor Author

ajnavarro commented Nov 28, 2018

@smola I changed files table to be partitioned by the root tree entry on the main issue description.

How we can make references works like any other table? they are not into the packfile, so they don't appear on the index, so we cannot apply the following partition to them:

[REPOSITORY_ID]+[PACKFILE_HASH]+[OFFSET_FROM]+[OFFSET_TO]

Because the references are just that, references to an object into the packfile, we can just show on each partition the references that are pointing to objects inside that partition.

@smola
Copy link
Contributor

smola commented Nov 29, 2018

How we can make references works like any other table?

I guess they would need a different partitioning scheme. For example:
REPOSITORY_ID+PARTITION_NUM, then selecting references with something like fast_hash(reference_name) % PARTITION_NUM.

The problem I see with partitioning them based on what they point to, is that they can reference tag objects or commit objects, so its logic is not that trivial?

@ajnavarro
Copy link
Contributor Author

ajnavarro commented Nov 29, 2018

@smola If they are objects, they will be in the index, doesn't matter if it is a tag object or a commit object.

Maybe I'm not understanding completely what you want to mean.

@smola
Copy link
Contributor

smola commented Nov 29, 2018

@smola Actually I was thinking about ways to determine partitions upfront. But I see you're proposing to check repositories to find the partitions:

Partitions will be done consulting the packfile indexes when PartitionIter is requested. All the partitions will have the same amount of objects, excluding the last one. This operation should be fast enough to do it per query. We can cache the partition names if necessary.

So, yeah, your approach of "references by offset of the object they point to" should work.

@kuba--
Copy link
Contributor

kuba-- commented Jan 30, 2019

Static partition sizes have some benefits. For instance you don't have calculate mapping where is what. On the other hand it still won't be super balanced. We may try to distribute objects across multiple partitions and track their sizes (kind of round-robin), but it will require an extra mapping where the object is or what object is there.
Last but not least, I'm not sure if we need super balanced partitions. For small sizes maybe it's better to have one bigger partition instead of many tiny partitions (because of extra iteration + context switching if we want to parallelise queries)

@ajnavarro
Copy link
Contributor Author

Having this issue in mind #686 this proposal makes less sense.

We should think in other ways to parallelize queries. Instead of just parallelize per repository, we can parallelize parts of the tree after reading the data from tables.

For the other hand, if we still want to create more than one partition per repository, we should think about how to split the DAG into several parallel parts, as another idea.

@ajnavarro
Copy link
Contributor Author

We based our parallelization logic on some ideas of this paper:

There you can see other ways to parallelize data processing.

@ajnavarro ajnavarro added this to the OKR-2019-Q1-P2 milestone Feb 8, 2019
@erizocosmico
Copy link
Contributor

Do we still want to do this @ajnavarro?

@ajnavarro
Copy link
Contributor Author

Will solve a lot of the problems that we actually have (unknown query processing time left, repositories taking a lot of time to process and so on.). I thought about partitioning at repository level + split the DAG into more or less similar parts, but I didn't have time for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance improvements proposal proposal for new additions or changes research Something that requires research
Projects
None yet
Development

No branches or pull requests

4 participants