Create partitions with more similar size #619

ajnavarro · 2018-11-16T09:21:12Z

Partitions with similar size

Description

Right now, our partition implementation is a partition per repository, that means, partitions can have really different sizes, causing problems like really long times processing 5% of the total amount repositories (only one thread parsing a huge repo).

With this proposal, I want to normalize partition size changing the way we split the data into several threads.

Proposal

Change actual partition ID:

[REPOSITORY_ID]

To:

[REPOSITORY_ID]+[PACKFILE_HASH]+[OFFSET_FROM]+[OFFSET_TO]

Values description

[REPOSITORY_ID]: gitbase ID identifying the repository. Examples gitbase, kubernetes, spark.
[PACKFILE_HASH]: packfile hash. Example: ccbe18d4ba50a14a28855a7b53880d93d87dc498
[OFFSET_FROM]: packfile offset where the first object for that partition is. Example 23144
[OFFSET_TO]: packfile offset where the last object for that partition is. Example 7009211

How to do repartition

We need to decide the number of objects that will be on each partition to generate the offsets. That object count will be a magic number and won't be changed. If we change it over configuration, indexes and other internal modules will not work correctly.

Partitions will be done consulting the packfile indexes when PartitionIter is requested. All the partitions will have the same amount of objects, excluding the last one. This operation should be fast enough to do it per query. We can cache the partition names if necessary.

Taking into account repositories updates, this approach is totally valid. Partitions are related to the packfile hash, so the content cannot change, only be deleted or create new packfiles.

How to handle this partitions on each main tables

Object tables (Commits, Blobs, Trees, Tree entries)

Each partition will only iterate the objects on their ranges, so no objects will be duplicated at the output.

References

Only references that are pointing to an object inside the partition range will be an output for the partition. To do this, we can use the packfile indexes and should be fast.

Files

Files will be appearing only on the partition where the root tree entry is present. Might be necessary to get tree_entries from other packfile ranges to generate the path using the repository.

Caveats

Problems

We need to check how go-git behaves when several threads are using it.

Benefits

Partitions will be more homogeneous and will take more or less the same time to process, giving to the user a good approximation about the size of the query and remaining time.

The text was updated successfully, but these errors were encountered:

smola · 2018-11-28T12:34:30Z

Files will be appearing only on the partition where the blob is present. Might be necessary to get tree_entries from other packfile ranges to generate the path using the repository.

I think it will be easier to implement and more consistent with other tables if we partition according to root tree_entry, not blob hash.

Only references that are pointing to an object inside the partition range will be an output for the partition. To do this, we can use the packfile indexes and should be fast.

Can't we just partition them as any other table?

ajnavarro · 2018-11-28T13:59:39Z

@smola I changed files table to be partitioned by the root tree entry on the main issue description.

How we can make references works like any other table? they are not into the packfile, so they don't appear on the index, so we cannot apply the following partition to them:

[REPOSITORY_ID]+[PACKFILE_HASH]+[OFFSET_FROM]+[OFFSET_TO]

Because the references are just that, references to an object into the packfile, we can just show on each partition the references that are pointing to objects inside that partition.

smola · 2018-11-29T10:41:13Z

How we can make references works like any other table?

I guess they would need a different partitioning scheme. For example:
REPOSITORY_ID+PARTITION_NUM, then selecting references with something like fast_hash(reference_name) % PARTITION_NUM.

The problem I see with partitioning them based on what they point to, is that they can reference tag objects or commit objects, so its logic is not that trivial?

ajnavarro · 2018-11-29T11:59:19Z

@smola If they are objects, they will be in the index, doesn't matter if it is a tag object or a commit object.

Maybe I'm not understanding completely what you want to mean.

smola · 2018-11-29T12:17:21Z

@smola Actually I was thinking about ways to determine partitions upfront. But I see you're proposing to check repositories to find the partitions:

Partitions will be done consulting the packfile indexes when PartitionIter is requested. All the partitions will have the same amount of objects, excluding the last one. This operation should be fast enough to do it per query. We can cache the partition names if necessary.

So, yeah, your approach of "references by offset of the object they point to" should work.

kuba-- · 2019-01-30T09:14:41Z

Static partition sizes have some benefits. For instance you don't have calculate mapping where is what. On the other hand it still won't be super balanced. We may try to distribute objects across multiple partitions and track their sizes (kind of round-robin), but it will require an extra mapping where the object is or what object is there.
Last but not least, I'm not sure if we need super balanced partitions. For small sizes maybe it's better to have one bigger partition instead of many tiny partitions (because of extra iteration + context switching if we want to parallelise queries)

ajnavarro · 2019-01-30T10:17:35Z

Having this issue in mind #686 this proposal makes less sense.

We should think in other ways to parallelize queries. Instead of just parallelize per repository, we can parallelize parts of the tree after reading the data from tables.

For the other hand, if we still want to create more than one partition per repository, we should think about how to split the DAG into several parallel parts, as another idea.

ajnavarro · 2019-01-30T10:23:18Z

We based our parallelization logic on some ideas of this paper:

There you can see other ways to parallelize data processing.

erizocosmico · 2019-10-09T08:45:45Z

Do we still want to do this @ajnavarro?

ajnavarro · 2019-10-09T09:30:08Z

Will solve a lot of the problems that we actually have (unknown query processing time left, repositories taking a lot of time to process and so on.). I thought about partitioning at repository level + split the DAG into more or less similar parts, but I didn't have time for this.

ajnavarro added proposal proposal for new additions or changes performance Performance improvements research Something that requires research labels Nov 16, 2018

ajnavarro added this to the OKR-2019-Q1-P2 milestone Feb 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create partitions with more similar size #619

Create partitions with more similar size #619

ajnavarro commented Nov 16, 2018 •

edited

Loading

smola commented Nov 28, 2018

ajnavarro commented Nov 28, 2018 •

edited by smola

Loading

smola commented Nov 29, 2018 •

edited

Loading

ajnavarro commented Nov 29, 2018 •

edited

Loading

smola commented Nov 29, 2018

kuba-- commented Jan 30, 2019 •

edited

Loading

ajnavarro commented Jan 30, 2019

ajnavarro commented Jan 30, 2019

erizocosmico commented Oct 9, 2019

ajnavarro commented Oct 9, 2019

Create partitions with more similar size #619

Create partitions with more similar size #619

Comments

ajnavarro commented Nov 16, 2018 • edited Loading

Partitions with similar size

Description

Proposal

Values description

How to do repartition

How to handle this partitions on each main tables

Object tables (Commits, Blobs, Trees, Tree entries)

References

Files

Caveats

Problems

Benefits

smola commented Nov 28, 2018

ajnavarro commented Nov 28, 2018 • edited by smola Loading

smola commented Nov 29, 2018 • edited Loading

ajnavarro commented Nov 29, 2018 • edited Loading

smola commented Nov 29, 2018

kuba-- commented Jan 30, 2019 • edited Loading

ajnavarro commented Jan 30, 2019

ajnavarro commented Jan 30, 2019

erizocosmico commented Oct 9, 2019

ajnavarro commented Oct 9, 2019

ajnavarro commented Nov 16, 2018 •

edited

Loading

ajnavarro commented Nov 28, 2018 •

edited by smola

Loading

smola commented Nov 29, 2018 •

edited

Loading

ajnavarro commented Nov 29, 2018 •

edited

Loading

kuba-- commented Jan 30, 2019 •

edited

Loading