Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chan z prefix sum #2781

Merged
merged 21 commits into from
Nov 6, 2024
Merged

Chan z prefix sum #2781

merged 21 commits into from
Nov 6, 2024

Conversation

amin1377
Copy link
Contributor

@amin1377 amin1377 commented Oct 18, 2024

Description

This PR optimizes the calculation of the ChanZ cost factor by precomputing and storing the cumulative number of inter-die connections prior to block placement. During placement, the number of inter-die connections within a bounding box is calculated using the following formula:

if (x_low == 0 && y_low == 0) {
    num_inter_dir_conn = acc_tile_num_inter_die_conn_[x_high][y_high];
} else if (x_low == 0) {
    num_inter_dir_conn = acc_tile_num_inter_die_conn_[x_high][y_high] - 
                         acc_tile_num_inter_die_conn_[x_high][y_low - 1];
} else if (y_low == 0) {
    num_inter_dir_conn = acc_tile_num_inter_die_conn_[x_high][y_high] - 
                         acc_tile_num_inter_die_conn_[x_low - 1][y_high];
} else {
    num_inter_dir_conn = acc_tile_num_inter_die_conn_[x_high][y_high] - 
                         acc_tile_num_inter_die_conn_[x_low - 1][y_high] - 
                         acc_tile_num_inter_die_conn_[x_high][y_low - 1] + 
                         acc_tile_num_inter_die_conn_[x_low - 1][y_low - 1];
}

@github-actions github-actions bot added VPR VPR FPGA Placement & Routing Tool lang-cpp C/C++ code labels Oct 18, 2024
@vaughnbetz
Copy link
Contributor

  1. You should comment how the prefix sum computation works.
  2. We can't afford an O(n^2) memory and O(n^2) compute to load this data structure, as N is the size of the circuit. We shouldn't compute the final matrix, but instead just store the O(n) (2D in x,y) prefix sum. Compute the average z-capacity in a helper function when called.

@vaughnbetz
Copy link
Contributor

QoR results are at https://www.eecg.utoronto.ca/~vaughn/vaughnwiki/doku.php?id=amin_weekly_update and look good. On both 3D SW and 2D Titan architectures, pack, place and route time and QoR are essentially unchanged, and on a 3D arch total runtime drops by 2%, which makes sense as the loading of the cost structures should be faster (and may be counted outside place time). @amin1377 : if you can paste the QoR data in for posterity that would be good.

Copy link
Contributor

@vaughnbetz vaughnbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some commenting updates, and one functional change to divide by layer-1.

vpr/src/place/net_cost_handler.cpp Outdated Show resolved Hide resolved
vpr/src/place/net_cost_handler.cpp Outdated Show resolved Hide resolved
vpr/src/place/net_cost_handler.cpp Outdated Show resolved Hide resolved
vpr/src/place/net_cost_handler.cpp Outdated Show resolved Hide resolved
vpr/src/place/net_cost_handler.cpp Show resolved Hide resolved
vpr/src/place/net_cost_handler.cpp Show resolved Hide resolved
vpr/src/place/net_cost_handler.cpp Outdated Show resolved Hide resolved
vpr/src/place/net_cost_handler.h Outdated Show resolved Hide resolved
vpr/src/place/net_cost_handler.h Outdated Show resolved Hide resolved
vpr/src/place/net_cost_handler.h Outdated Show resolved Hide resolved
vpr/src/place/net_cost_handler.cpp Outdated Show resolved Hide resolved
const auto& device_ctx = g_vpr_ctx.device();
const auto& rr_graph = device_ctx.rr_graph;

const size_t grid_height = device_ctx.grid.height();
const size_t grid_width = device_ctx.grid.width();


chanz_place_cost_fac_ = vtr::NdMatrix<float, 4>({grid_width, grid_height, grid_width, grid_height}, 0.);
acc_tile_num_inter_die_conn_ = vtr::NdMatrix<int, 2>({grid_width, grid_height}, 0.);

vtr::NdMatrix<float, 2> tile_num_inter_die_conn({grid_width, grid_height}, 0.);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does tile_num_inter_die_conn need to store float values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a count, so int is probably better. Float could have some round-off issues with big chips, as this matrix is filled in by adding small numbers to a running count, which can get problematic if the big number ever becomes more than 16 million times or so bigger than the small number (the small number then gets thrown away by round off).

vpr/src/place/net_cost_handler.cpp Show resolved Hide resolved
vpr/src/place/net_cost_handler.cpp Show resolved Hide resolved
vpr/src/place/net_cost_handler.cpp Show resolved Hide resolved
acc_tile_num_inter_die_conn_[x_low-1][y_high] - \
acc_tile_num_inter_die_conn_[x_high][y_low-1] + \
acc_tile_num_inter_die_conn_[x_low-1][y_low-1];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using vtr::NdOffsetMatrix will be consistent with thest of file and will get rid of this if statement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soheilshahrouz, if you remember, my initial attempt was to use this data structure, but I encountered some memory issues with it. I’ll create an issue for this and, in another PR, attempt to replace this data structure with vtr::NdOffsetMatrix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I figured out what the problem was.
Check if 68ecb55 sovles the memory issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll try it once the PR is merged.

vpr/src/place/net_cost_handler.cpp Outdated Show resolved Hide resolved
vpr/src/place/net_cost_handler.h Outdated Show resolved Hide resolved
vpr/src/place/net_cost_handler.cpp Outdated Show resolved Hide resolved
vpr/src/place/net_cost_handler.cpp Outdated Show resolved Hide resolved
@vaughnbetz
Copy link
Contributor

This is separate from this PR, but it's possible that using the same technique in the x- and y- directions that you're using for the z-direction to calculate the average channel width by adding and subtracting the proper values would be faster for x- and y- too.

It saves memory (would take storage for x and y average channel capacities down from O(device size) to O(sqrt(device size)), which would help the cache and memory bandwidth. If would mean two matrix accesses instead of 1, and an extra reciprocal and pow() call, but it might still be overall just as fast or faster. After this PR is landed, if you have th energy for it you could do a quick experiment to check by recoding the x- and y- dimensions to match the z-dimension approach.

@soheilshahrouz
Copy link
Contributor

This is separate from this PR, but it's possible that using the same technique in the x- and y- directions that you're using for the z-direction to calculate the average channel width by adding and subtracting the proper values would be faster for x- and y- too.

@amin1377 If you're busy with other tasks, I am interested in working on this idea for chan_x and chan_y. Let me know if I can take it over.

@amin1377
Copy link
Contributor Author

amin1377 commented Nov 5, 2024

This is separate from this PR, but it's possible that using the same technique in the x- and y- directions that you're using for the z-direction to calculate the average channel width by adding and subtracting the proper values would be faster for x- and y- too.

@amin1377 If you're busy with other tasks, I am interested in working on this idea for chan_x and chan_y. Let me know if I can take it over.

How could I possibly say no to such a generous offer, Soheil? Thanks a lot! :) Let me know if I can be of any help.

@amin1377
Copy link
Contributor Author

amin1377 commented Nov 5, 2024

QoR:
titan other 3d CB cube bb
3d_titan_other_full_opin_cube_bb
titan_quick_qor
titan_quick_qor

@vaughnbetz
Copy link
Contributor

QoR looks good.

@amin1377
Copy link
Contributor Author

amin1377 commented Nov 6, 2024

@vaughnbetz, I’ve addressed both your comments and Soheil’s. After making the changes, I reran the benchmarks to ensure everything is still working correctly, and the tests passed without issues. I think the PR is ready to be merged.

@vaughnbetz vaughnbetz merged commit e179d88 into master Nov 6, 2024
37 checks passed
@vaughnbetz vaughnbetz deleted the chan_z_prefix_sum branch November 6, 2024 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants