Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: per-partition joins #716

Open
erizocosmico opened this issue Feb 28, 2019 · 2 comments
Open

Idea: per-partition joins #716

erizocosmico opened this issue Feb 28, 2019 · 2 comments
Labels
enhancement New feature or request performance Performance improvements

Comments

@erizocosmico
Copy link
Contributor

erizocosmico commented Feb 28, 2019

Right now, since 1 partition means 1 repository, we know joins (by repository) can only happen in the same partition.
Instead, we iterate and try to join with all of the partitions together.

Imagine we have 3 partitions.

These are the rows returned by each partition in the left side of a join.

  • P1: 45
  • P2: 5
  • P3: 50

These are the rows returned by each partition in the right side of a join.

  • P1: 55
  • P2: 15
  • P3: 30

We are joining 100 rows with 100 rows, which produces 10000 rows, that are then filtered by the join conditions (but we still make those 10k iterations).

Instead, if we did this per partition, these would be the produced rows (then filtered by conditions):

  • P1: 2475
  • P2: 75
  • P3: 1500

The total amount of rows produced is 4050 rows, which is a 40% of the number of rows generated before. This number grows enormously as the number of partitions and rows grow.

What could we do?

A rule that runs at the end of the analysis and transforms joins (the ones left after the squash) into something like:

Concat
 |- InnerJoin
    |- PartitionTable(TableA)
    |- PartitionTable(TableB)

PartitionTable is a table that will only return the rows for one partition.
Concat is a node that will iterate over all partitions and transform all its Table children into PartitionTable. Then, all the rows of each partition will be put together and returned to the user. This will also happen in parallel.
Essentially, Concat is like an Exchange. The only thing it differs is the fact that it can handle binary nodes and not only unary nodes. This is something that cannot be done in go-mysql-server but can be done here, since we know for certain that partitions are the same for each table.

Called it Concat but the name is pretty lame so we should think of a better name, like PartitionExchange, BinaryExchange or something like that.

This should make (not squashed) joins —and in a real life applications you will have many of them because leaves will be subqueries— much much faster.

@erizocosmico erizocosmico added enhancement New feature or request performance Performance improvements labels Feb 28, 2019
@ajnavarro
Copy link
Contributor

ajnavarro commented Mar 5, 2019

If I understood correctly, this will be in that way only if the joins conditions are relations that are only possible inside one repository, right?

In that case, what is the difference between this parallelized join and squashed tables?.

@erizocosmico
Copy link
Contributor Author

@ajnavarro yup

This is tailored for joins after squashing. There are cases in which you will have joins for sure, for example, when you are joining the result of two subqueries, which is a pretty common thing to have in any real world application.

Consider the following query:

SELECT uast_extract(
    uast(blob_content, 'PHP', "//uast:String"),
    '@pos') AS positions,
    repository_id,
    file_path
FROM (
    SELECT f.repository_id,
        f.file_path,
        b.blob_content
    FROM (
        SELECT *
        FROM refs r
        NATURAL JOIN commit_blobs cb
        NATURAL JOIN blobs
        WHERE r.ref_name = 'HEAD'
            AND NOT IS_BINARY(blob_content)
    ) b
    INNER JOIN (
        SELECT repository_id, file_path, blob_hash
        FROM refs r
        NATURAL JOIN commit_files cf
        WHERE r.ref_name = 'HEAD'
    ) f
    ON b.blob_hash = f.blob_hash
        AND b.repository_id = f.repository_id
    WHERE language(f.file_path, b.blob_content) = 'PHP'
) t
WHERE positions IS NOT NULL

This cannot be squashed. Because of it, you're performing a massive cross join on a lot of data if you run it on a big dataset. If you do the joins per-partition, you're generating way less data and thus making it much faster.
It also has another advantage: since we know only rows from the same partition will match, in-memory joins become more feasible. You don't need to keep in memory all rows of one side, only all rows of a partition. Then do next partition and so on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Performance improvements
Projects
None yet
Development

No branches or pull requests

2 participants