Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching geography codes with the area names #18

Merged
merged 88 commits into from
Apr 15, 2020
Merged

Conversation

jvillacampa
Copy link
Contributor

@jvillacampa jvillacampa commented Jan 24, 2020

As in issue #13. Function that matches a column containing geography codes with their respective area names. I don't think I have done a very good job with the tests and covering potential problems with user's inputs, so any feedback very welcomed.

fixes #13.

@jvillacampa jvillacampa added the enhancement New feature or request label Jan 24, 2020
@jvillacampa jvillacampa self-assigned this Jan 24, 2020
@jackhannah95
Copy link
Contributor

Hi @jvillacampa. Just to say that @davidc92 and @lucindalawrie are going to do a review of this and get back to you soon.

Copy link
Contributor

@davidc92 davidc92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jaime!

Have a few initial comments. I'll take another look once you've made these changes and then maybe start looking at the documentation and tests.

  • Can you add some error handling at the start of the function? Checks that whatever is input is the correct object e.g. is dataset a tibble? If not, throw out an error. Might also be worth running a check on the format of the (presumably) string from code_var. The geography codes are all a standard format so I don't think it would be too difficult to check that the string matches regular expressions for these standard formats.
  • You need to add the package names before function calls. That way the user doesn't have to have the packages loaded and also mitigates the risk of one packaging masking a function from another. e.g. you would type janitor::clean_names().
  • Can you be picky over spacing? I.e. add a space after a hash for comments and add blank lines before comments and try and break your code up with blank lines so that it isn't all clumped together? Can you also be stricter with the 80 character limit per line?

Just to note - I looked at the travis output for the checks and it looks like the checks are failing because the unit tests aren't passing. Might be worth a look.

R/match_geo_names.R Outdated Show resolved Hide resolved
R/match_geo_names.R Outdated Show resolved Hide resolved
R/match_geo_names.R Outdated Show resolved Hide resolved
R/match_geo_names.R Outdated Show resolved Hide resolved
R/match_geo_names.R Outdated Show resolved Hide resolved
R/match_geo_names.R Outdated Show resolved Hide resolved
R/match_geo_names.R Outdated Show resolved Hide resolved
R/match_geo_names.R Outdated Show resolved Hide resolved
@jvillacampa
Copy link
Contributor Author

Cheers David, I will take a look to your comments in the next few days

Copy link
Contributor

@davidc92 davidc92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jaime! Just a couple of small comments here. I tried to test the package on some of my own data, but it errored because you are using the %<>% operator without importing it. You can import it by editing the R/phsmethods.R script and adding @importFrom magrittr %<>% below the other import line. Otherwise the user needs to have magrittr loaded for the function to work. Can you also make sure any package imports you add to the DESCRIPTION file are in alphabetical order?

I'll take a closer look at the code this afternoon :)

DESCRIPTION Outdated Show resolved Hide resolved
R/match_geo_names.R Outdated Show resolved Hide resolved
R/match_geo_names.R Outdated Show resolved Hide resolved
@jvillacampa
Copy link
Contributor Author

Hi @davidc92, sorry I pushed some changes, but I still need to deal with some of your comments from before! I'll let you know when I have got everything ready. Cheers.

@jvillacampa
Copy link
Contributor Author

Hi @davidc92, thank you for all the comments. I think I have dealt with all of them now and is ready for a new round of review.
A couple of points:

  • The code extracting the info from the API is in the function at the moment, but it won't run as the reference file exists. Not sure how to ensure this file is kept up to date in a low-maintenance way.

  • The devtools::check fails as during the tests (test_that) it fails as it can't open the reference files I use. Slightly confused about why is this. Could this be solved using the package here?

@davidc92
Copy link
Contributor

davidc92 commented Feb 6, 2020

Hi @davidc92, thank you for all the comments. I think I have dealt with all of them now and is ready for a new round of review.
A couple of points:

  • The code extracting the info from the API is in the function at the moment, but it won't run as the reference file exists. Not sure how to ensure this file is kept up to date in a low-maintenance way.
  • The devtools::check fails as during the tests (test_that) it fails as it can't open the reference files I use. Slightly confused about why is this. Could this be solved using the package here?

Adding here() might actually solve some of the tests failing - give that a go and see if it works. Also, you'll have to update one of the unit tests for file_size(). Line 8 is testing that the number of files in the tests directory is equal to 8, but you have also added a test file so the 8 should become 9 :)

@jvillacampa
Copy link
Contributor Author

Using "here" doesn't help in this case. Apparently this is a well known issue. I have tried the proposed workaround but still fails. I have done a workaround replicating one of the files and taking out one of the tests that I don't think added much.

However, after doing these changes still fails. The problem is that when the function runs it doesn't recognize that the lookup in the reference files folder exists and then it tries to connect to the open data platform, which could be fine, but for some reason it doesn't find the function getURL from RCurl (which is a dependency of SPARQL). I have tried a couple of things, but a bit stuck here.

The tests run perfectly fine manually, fail as described above for devtools::checks(), and for devtools::test() they connect to the open data platform but fail when it comes to the line where it saves the data in the reference file folder.

I have pushed the latest version of what I was trying, but any help/advice welcome.

@jackhannah95
Copy link
Contributor

Hi @jvillacampa,

Sorry, I have gone overboard with this. The more I thought about it, the more I thought that requiring a data frame to be provided to the function probably meant that it wasn't any less effort than just doing your own join to a lookup file. Some of the commits relate to small changes to the other functions at least.

I think the ease with a function like this would come with being able to just input directly anything from a single code to a column with a million of them and have it return the names in the same format. Something like:

df %>%
  mutate(area_name = match_area(geo_code))

I also thought it should be able to do the reverse, since both are available in the lookup. Please let me know if you're not happy with anything or have any improvements to suggest. Or if it's all terrible and you preferred the original way.

I took the same approach to saving, storing and calling data as the janeaustenr package. It seems to be working fine, and has the handy benefit that R >= 3.5 isn't required and no duplicates need to be created for unit tests. Again though, let me know if you think anything could be done better.

The SPARQL code extracts codes that have no accompanying area name, but I had to drop them as otherwise entering an NA name returned all those codes. In practice dropping them doesn't make a difference to the function as you intended it, because entering any code that isn't in the lookup dataset returns an NA anyway. It'd be more efficient to edit the SPARQL query to only extract codes with an accompanying area name, rather than bringing them all in and dropping all NA values. I tried editing the query and it was a disaster. Any help you're able to provide with that would be much appreciated.

Cheers,

Jack

@jvillacampa
Copy link
Contributor Author

Hi @jackhannah95, thanks for doing that.

Linking name to code - I thought about it, but the main problem is that names are not unequivocal. I really don't think the way the function works would provide useful output. Or at least I don't think I would use to get codes from names as it would require me doing quite a bit of extra work for those cases with multiple codes, which are pretty common. I started taking all this functionality out the code, but I thought I better check it with you first.

I like what you have done so it takes a vector/variable instead of a data frame. Much neater.

I have a bunch of small edits that I will be doing, but I'll wait to hear from you before pushing stuff.
Thanks,
Jaime

@jvillacampa
Copy link
Contributor Author

Just to say, that I think the feature of names to codes would be very useful, but it would need some more control, so users can decide the area type of their input (e.g. HB and they get S08 codes), and we will need a more complete lookup so more combinations of names are included (e.g. NHS Highland and not only Highland or Dumfries and Galloway and Dumfries & Galloway)

@jackhannah95
Copy link
Contributor

jackhannah95 commented Apr 9, 2020

Hi @jvillacampa,

You know better than I do what'd be useful functionality in practice so yeah by all means make any changes you want to. I'm not precious about stuff being deleted.

@jvillacampa
Copy link
Contributor Author

@jackhannah95 I uploaded some changes, all cosmetic apart from taking the "code" functionality. I think that would be very useful to have, but maybe for a future release I can figure out based on what you started. I tried to figure out how to drop the NAs from the SPARQL query but I failed, but your solution works well so not sure we need to look at it more. Thanks a lot for the work you have put into this, it's much better now!

@jackhannah95
Copy link
Contributor

Looks good @jvillacampa 👍 there are a few small things to do like delete some old tests and re-word the odd bit of documentation but nothing major.

I guess the only bits I would still quibble over are some of the names. code_var is more descriptive than x, granted, but I don't think many people will ever actually type out the variable name. We've called the first argument x in every other function (apart from file_size() but it takes a filepath rather than a value) and for the sake of consistency I'd prefer it to be the same in this one too, if that's okay.

Similarly, within the function itself, no_9char_codes is definitely more descriptive than n, but it's invisible to the user and I just wanted a concise way of getting that value into the warning message. I don't think it matters too much that it doesn't really describe what that value is (I think the comment is probably descriptive enough) so ideally I'd like to shorten that back to n too if that'd be alright.

Chances are I'll have to do something at the end to make travis build anyway so I'm happy to do those bits (it'll be one or two commits max) and give you another look before this is finally merged. Would that be alright?

@jvillacampa
Copy link
Contributor Author

jvillacampa commented Apr 9, 2020 via email

@jackhannah95
Copy link
Contributor

Haha thanks! Sorry for bugging you. If you ever become a maintainer in the future you're welcome to change all the argument names. But then I guess you'd have to explain to everyone why their code doesn't work anymore 🤷

Hopefully these last two commits take care of everything and we can finally merge this.

Copy link
Contributor

@jackhannah95 jackhannah95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's time

@jvillacampa
Copy link
Contributor Author

I think now @davidc92 needs to approve before we can merge

@jackhannah95 jackhannah95 merged commit fe12c7a into master Apr 15, 2020
@jackhannah95 jackhannah95 deleted the geo_names branch April 15, 2020 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Matching geography codes to their names
3 participants