Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overall package vision #9

Open
eroten opened this issue Jul 1, 2020 · 1 comment
Open

Overall package vision #9

eroten opened this issue Jul 1, 2020 · 1 comment
Assignees

Comments

@eroten
Copy link
Collaborator

eroten commented Jul 1, 2020

Access and clean raw data

The fundamental purpose of this package is to access loop detector from MnDOT's JSON feed. The data can be somewhat "dirty", and the package will include functions for finding nulls and interpolating values, flagging impossible values, and formatting column names and classes.

Data storage

There are pros and cons to putting the cleaned sensor data into a database.

  • By putting it in a database we can flexibly specify the time period we're interested in (at the moment, 3 years' worth of data), the time interval that is relevant (we currently use daily data), and the geographic resolution that we need (nodes comprised of multiple sensors).
  • Sometimes we will want 15-minute or hourly data; sometimes we will want data going back a decade (for the congestion report), and sometimes we will want data aggregated to the level of corridors, or split down for individual linked lanes of a corridor, or for individual sensors...and any combination of those three kinds of resolution (historic scope, temporal resolution, spatial resolution).
  • once it's in our database, outside folks can't really use the data. Having the data in an internal only database defeats the open-source idea behind the package, in that we are allowing people to see how we calculate the various measures and QA/QC the data.
  • We also need to consider the cost of physically storing the data.

Aggregate

The raw data is provided in 30 second intervals. Common temporal aggregations include 10, 15, and 30 minutes, 1 hour, morning and evening peak periods, and 24 hours.

The raw data is accessed for an individual sensor. Sensors can be aggregated up to nodes/stations, corridors, lanes (?). We need functionality for aggregating nodes, stations, and corridors up to polylines.

Calculate

Aggregated data can be used to calculate various measures.

  • Flow The number of vehicles that pass through a detector per hour
  • Headway The number of seconds between each vehicle
  • Density The number of vehicles per mile
  • Speed The average speed of the vehicles that pass in a sampling period
    • UPDATE: Speed is calculated as part of aggregate_sensor_data()
  • Lost/Spare Capacity The average flow that a roadway is losing, either due to low traffic or high congestion, throughout the sampling period.
    • Flow > 1800: 0
    • Density > 43: Lost Capacity: Flow - 1800
    • Density >= 43: Lost Capacity: 1800 - Flow
  • Vehicle Miles Traveled (VMT)
  • Others(?)

General practices

  • The loop detector data is very large, particularly when working with multiple detectors and days. Generally, rely on {data.table} rather than {dplyr}, {tidyr}, and other packages.
    • If you are making the transition from {dplyr} to {data.table}, use {dtplyr} to "translate" between the two. However, don't forget to remove all {dtplyr} functions before pushing.
@ashleyasmus
Copy link
Contributor

ashleyasmus commented Sep 29, 2020

I just got off the phone with Tim Johnson (MNIT), a software developer working with the MnDOT loop detector data. It was a really nice call - he's super kind and easy to talk to, and totally on board with what we want to do with the data.
I called him because I was talking with him about the server issues, and he said that we should chat if I had thoughts about things they could do on the server side (aggregations and transformations of the data) to make our work easier.

He said several things that were really promising. One was that the work we were doing with the traffic data to download, aggregate, transform and load it into our own database was work that was also being duplicated by other groups (academia, gov't) and work they saw as more ideally performed closer to the server side to keep things standardized . I mentioned issues with making sure the way we identified data/sensors as trustworthy, and tracking changes to that "field_length" (vehicle length) attribute over time. He completely agreed and said that they were having similar discussions in his own group.

Another thing is that his team is all about open-source software development. Currently the traffic data server is written in a language called RUST -- I'm not familiar with it at all, and he said not to worry, that we could perhaps submit issues or ideas of things we might want built on the server side, and they could do it.

He also talked about how there are internal discussions about whether they should start to store the data in a formal database, especially if we were going to create derived fields. I said ideally they would have a database that I could just query out the data I needed, at the spatial and temporal aggregation that made sense for me. He agreed that this work needed to be done and that he'd like to involve us more on brainstorming what exactly that would look like.

@eroten eroten unpinned this issue Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants