Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"My Documents" data fetch needs overhaul #91

Open
jennifer-richards opened this issue Jan 3, 2024 · 2 comments
Open

"My Documents" data fetch needs overhaul #91

jennifer-richards opened this issue Jan 3, 2024 · 2 comments
Assignees

Comments

@jennifer-richards
Copy link
Member

The "My Documents" page loads data to display by retrieving all the RfcToBe data and all assignments from the backend, then filters these on the client side. This is extremely slow (takes ~11 seconds on my dev machine with the migrated data set) and reveals data that should perhaps not be visible to every client.

A substantial chunk of this time (around 75-80%) is spent computing history that is not actually shown on the view. (This is a bit worrying because there's essentially no history in the migrated data, but it's not time to panic about it.)

@microamp
Copy link
Contributor

microamp commented Sep 13, 2024

I can reproduce the behaviour locally. When I refresh the "My Documents" pages. The client makes the following API calls.

All calls took less than a second except the documents one which took 15+ seconds.

I also have gathered database logs for the page load. I stored them in a file called db.log.

head -5 db.log
2024-09-12 22:47:36.258 UTC [156] LOG:  statement: SELECT set_config('TimeZone', 'UTC', false)
2024-09-12 22:47:36.260 UTC [156] LOG:  statement: SELECT "django_session"."session_key", "django_session"."session_data", "django_session"."expire_date" FROM "django_session" WHERE ("django_session"."expire_date" > '2024-09-12T22:47:36.254995+00:00'::timestamptz AND "django_session"."session_key" = 'hl4gjb6qbn3enolzowalxyixce22h8bc') LIMIT 21
2024-09-12 22:47:36.263 UTC [156] LOG:  statement: SELECT "rpcauth_user"."id", "rpcauth_user"."password", "rpcauth_user"."last_login", "rpcauth_user"."is_superuser", "rpcauth_user"."username", "rpcauth_user"."first_name", "rpcauth_user"."last_name", "rpcauth_user"."email", "rpcauth_user"."is_staff", "rpcauth_user"."is_active", "rpcauth_user"."date_joined", "rpcauth_user"."name", "rpcauth_user"."datatracker_subject_id", "rpcauth_user"."avatar" FROM "rpcauth_user" WHERE "rpcauth_user"."id" = 1 LIMIT 21
2024-09-12 22:47:36.264 UTC [157] LOG:  statement: SELECT set_config('TimeZone', 'UTC', false)
2024-09-12 22:47:36.264 UTC [156] LOG:  statement: SELECT "datatracker_datatrackerperson"."datatracker_id" FROM "rpc_rpcperson" INNER JOIN "datatracker_datatrackerperson" ON ("rpc_rpcperson"."datatracker_person_id" = "datatracker_datatrackerperson"."id")
tail -5 db.log
2024-09-12 22:47:53.243 UTC [158] LOG:  statement: SELECT "datatracker_datatrackerperson"."id", "datatracker_datatrackerperson"."datatracker_id" FROM "datatracker_datatrackerperson" WHERE "datatracker_datatrackerperson"."datatracker_id" = 103961 ORDER BY "datatracker_datatrackerperson"."id" ASC LIMIT 1
2024-09-12 22:47:53.243 UTC [158] LOG:  statement: SELECT "rpc_rpcperson"."id", "rpc_rpcperson"."datatracker_person_id", "rpc_rpcperson"."hours_per_week", "rpc_rpcperson"."manager_id" FROM "rpc_rpcperson" WHERE "rpc_rpcperson"."datatracker_person_id" IS NULL ORDER BY "rpc_rpcperson"."id" ASC LIMIT 1
2024-09-12 22:47:53.267 UTC [158] LOG:  statement: SELECT "datatracker_datatrackerperson"."id", "datatracker_datatrackerperson"."datatracker_id" FROM "datatracker_datatrackerperson" WHERE "datatracker_datatrackerperson"."datatracker_id" = 103961 ORDER BY "datatracker_datatrackerperson"."id" ASC LIMIT 1
2024-09-12 22:47:53.289 UTC [158] LOG:  statement: SELECT "datatracker_datatrackerperson"."id", "datatracker_datatrackerperson"."datatracker_id" FROM "datatracker_datatrackerperson" WHERE "datatracker_datatrackerperson"."datatracker_id" = 103961 ORDER BY "datatracker_datatrackerperson"."id" ASC LIMIT 1
2024-09-12 22:47:53.289 UTC [158] LOG:  statement: SELECT "rpc_rpcperson"."id", "rpc_rpcperson"."datatracker_person_id", "rpc_rpcperson"."hours_per_week", "rpc_rpcperson"."manager_id" FROM "rpc_rpcperson" WHERE "rpc_rpcperson"."datatracker_person_id" IS NULL ORDER BY "rpc_rpcperson"."id" ASC LIMIT 1

The first and the last queries are about 17 seconds apart.

It looks like it's making a lot of database queries at the moment.

grep "statement: SELECT" db.log | wc -l
33187

About 30% of the calls are related to the rpc_historical* tables.

grep "FROM \"rpc_historical" db.log | wc -l
9568

At a quick glance, they mostly look like the following.

SELECT "rpc_historicalrfctobe"."id",
       "rpc_historicalrfctobe"."is_april_first_rfc",
       "rpc_historicalrfctobe"."rfc_number",
       "rpc_historicalrfctobe"."external_deadline",
       "rpc_historicalrfctobe"."internal_goal",
       "rpc_historicalrfctobe"."disposition_id",
       "rpc_historicalrfctobe"."draft_id",
       "rpc_historicalrfctobe"."submitted_format_id",
       "rpc_historicalrfctobe"."submitted_std_level_id",
       "rpc_historicalrfctobe"."submitted_boilerplate_id",
       "rpc_historicalrfctobe"."submitted_stream_id",
       "rpc_historicalrfctobe"."intended_std_level_id",
       "rpc_historicalrfctobe"."intended_boilerplate_id",
       "rpc_historicalrfctobe"."intended_stream_id",
       "rpc_historicalrfctobe"."history_id",
       "rpc_historicalrfctobe"."history_date",
       "rpc_historicalrfctobe"."history_change_reason",
       "rpc_historicalrfctobe"."history_type",
       "rpc_historicalrfctobe"."history_user_id"
FROM "rpc_historicalrfctobe"
WHERE "rpc_historicalrfctobe"."id" = 1
ORDER BY "rpc_historicalrfctobe"."history_date" DESC, "rpc_historicalrfctobe"."history_id" DESC

(This is a bit worrying because there's essentially no history in the migrated data, but it's not time to panic about it.)

My understanding is that when a document is first inserted into the database, it generates a history entry with + in the rpc_historicalrfctobe table. So, there will be at least one row of history for each document initially.

-- simplified version of the query above
SELECT "rpc_historicalrfctobe"."id",
       "rpc_historicalrfctobe"."history_id",
       "rpc_historicalrfctobe"."history_date",
       "rpc_historicalrfctobe"."history_type"
FROM "rpc_historicalrfctobe"
WHERE "rpc_historicalrfctobe"."id" = 1
 id | history_id |         history_date          | history_type
----+------------+-------------------------------+--------------
  1 |          1 | 2024-09-11 21:35:36.814197+00 | +

@jennifer-richards
Copy link
Member Author

My understanding is that when a document is first inserted into the database, it generates a history entry with + in the rpc_historicalrfctobe table. So, there will be at least one row of history for each document initially.

Sorry, I didn't mean it literally. What I meant to convey was that the document histories are very short right now compared to what they'll be when the system is deployed and documents are actually being processed. The "worrying" part is that, even with minimal history lengths it's already a near-dominant factor in the processing time.

Thanks for doing the detailed investigation. I'm hopeful that the performance we're seeing now will improve enormously when we go from a quick-and-dirty "just get it going" implementation to something more mature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants