Welcome! The vast amount of data produced by evolving information technology requires tools and skills. Among the many tools, R is a free, open-source language and an environment that could be used for data sciences. This course aims to cover topics in R and data science, with applications illustrated via RStudio. This course is part of the University of Toronto's Data Sciences Institute Professional Programming.
- Description
- Learning Outcomes
- Course Contacts
- Delivery Instructions
- Course Notes
- Materials
- Schedule
- Marking Scheme
- Course Policies
The first part of this course teaches R with a focus on manipulating and visualizing data. Learners will get set up with a functional RStudio workflow, use different file types, transform data tables, import and manipulate data, use functions and loops, create data visualizations, make a Shiny app, and learn how to solve problems with their programming. Both base R and tidyverse methods are taught. To work reproducibly, learners will create R Projects. The second part of the course will cover the ethics of consent, Equity, Diversity & Inclusion (EDI) training, and professional skills including presentation, project management, and data security. Finally, the course will conclude with an industry case study. This course is designed for learners who have a degree in something other than Computer Science/Statistics who are looking to enhance their data science skills for their career.
Learners will know how to:
- Comfortably access R, identify options for working with R, layout the purpose of using RStudio and R Projects, understand best R coding practices, and recognize where R stands among other data science tools. Further, learners will be able to navigate RStudio to write scripts, use different R data types and structures, use built-in R commands and accessing external functions via downloading R packages. This will be assessed in Assessment 1.
- Describe and define features of a dataset by applying manipulation and wrangling techniques. Learners will be able to access built-in R datasets and import external datasets into R to identify and describe data structures, apply manipulation techniques to reshape the datasets, detect missing values, clean data, summarize data, export data and report findings. This will be assessed in Assessment 2.
- Explain the strengths and limitations R workflows and analyses using concepts of reproducibility, bias, diversity, inclusion, ethical considerations, equity concepts, data security and best coding practices. This will be assessed in Assessment 2.
- Build a strategy for exploring data by designing functions that can take data as input, perform simple analyses, and generate exploratory plots as appropriate for data type and story to be told. This will be assessed in Assessment 2.
-
Instructor for this course is Anjali Silva, PhD (she/her). For emails to the instructor, use a.silva@utoronto.ca. Must use the subject line DSI-IntroR. E.g., DSI-IntroR: Inquiry about Lecture I. Response times: Week day: 48h and Weekends: 48h - 72h.
-
Teaching Assistant (TA) for this is Jessie Wang, PhD student (she/her).For emails to the TA, use jae.wang@mail.utoronto.ca.
- The course will be held over a period of 3 weeks, with classes taking place 3 days a week. Format will be online - synchronous via Zoom (Meeting ID: 248 642 5344; for passcode see email subject 'Data Sciences Institute, UofT – Welcome & Pre-Class info'). All course material will be available via IntroductionToR GitHub repository. If you experience issues with joining the live lectures, you must email TA and copy Instructor. Include the issue description, time, date (screenshot if available) to avoid loss of participation marks. Due to unavoidable circumstances, if the live (synchronous) lecture is disrupted or cannot be held, the instructor will upload the recording with an email announcement. It is the responsibility of the learners to view the recording.
-
All course material will be available via IntroductionToR GitHub repository. Folder structure is as follows:
- Assessments:
This folder contains assessment files for learners. - Lessons-AllFiles:
This folder contains all files (Rmarkdown, slide-html, slide-PDFs, images, data, etc.) and is designed for the instructor. - Lessons-Data:
This folder contains data only and is designed for the learners. Learners should download and copy this folder as 'data' folder within their R Project. - Lessons-PDF:
This folder contains slide-PDFs only and is designed for the learners. Learners should download the slides. Slides should be referenced before class to prepare or after class to review. During class, there will be mostly live-coding. The end of each slide deck will contain homework for that particular lesson. It is highly recommend that learners attempt these and attend tutorial sessions to seek help. - Lessons-Rscripts:
This folder contains R scripts used by the instructor. It will be udpated after each class and learners may download it for reference. - Teaching-Notes:
This folder contains lesson plans only and is designed to guide the instructor. - README: README file.
- .gitignore: List of files to ignore specified by instructor.
- Assessments:
- Learners must have internet connection, and a computer with administrative privileges, a microphone, and all required software installed in order to participate in online activities.
- Learners must have R (http://www.r-project.org/). We will help with downloading.
- Learners must have RStudio (Previously: http://www.rstudio.com/; now: https://posit.co/download/rstudio-desktop/). We will help with downloading.
- GitHub account (https://github.com/).
- Screen space can be a limitation during online learning since you'll want to see the instructor's screen and have your RStudio open so that you can type along. If you have access to a second monitor or a larger tablet to attend the course while keeping your laptop screen available for coding - this would be great! If not - don't worry, we'll manage!
- Key texts: General reference
- Wickham and Grolemund, 2017, R for Data Science, O'Reilly. https://r4ds.had.co.nz/
- Alexander (eds), 2021, DoSS Toolkit, https://rohanalexander.github.io/doss_toolkit_book/
- Key texts: For specific topics
- Alexander, 2022, Telling Stories with Data, CRC Press. https://www.tellingstorieswithdata.com/
- de Graaf, 2019. Managing Your Data Science Projects: Learn Salesmanship, Presentation, and Maintenance of Completed Models, Apress.
- Healy, 2018. Data Visualization: A Practical Introduction, Princeton University Press
- Timbers et al., 2021. Data Science: A First Introduction. https://ubc-dsci.github.io/introduction-to-datascience/
- Wickham, 2021. Mastering Shiny, O'Reilly. https://mastering-shiny.org/
- Wiley, Matt, Wiley, Joshua F., 2020. Advanced R 4 Data Programming and the Cloud
- Using PostgreSQL, AWS, and Shiny, Apress.
*Schedule may be modified as needed, and learners will be informed. Course will be taught using R version 4.2.1 and RStudio Desktop version 2022.02.3. All times in Eastern Standard Time (EST). Tutorials will be lead by the TA. Use tutorials to clarify assessment questions or to solve homework (HW) problems together.
Date | Topics, Learning Goals, Course Slides and Homework (HW) |
---|---|
Monday 7 November Tutorial 5pm-6pm Class 6pm-8pm |
Hello world and work practices - Data science tools, why R, options for working with R, and citing R. - Downloading R, RStudio, its anatomy and navigating RStudio environment. - Layout best R coding practices. - Understand importance of reproducibility and working with R Projects. - Identify components of a reprex. - Identify R syntax, how to get help, and use of built-in functions. - Perform mathematical operations in R. - Learn how to install R packages (CRAN, Bioconductor, GitHub). - Identify different file types and diagnosing of errors. - 00-introduction_deck.pdf - 01-hello-world_deck.pdf (HW: slides 30, 42+) - 02-work-practices_deck.pdf (HW: slide 30) - Recording: https://utoronto-ca-datasciencesinstitute.zoom.us/rec/share/wmjQ1LItj-EojCVZ1WhBHcFo5NevOYIgenvsW_GAjbuygX3gxtzoUkNFpf0GKjg.N5ULsw9Iklu0K0IC - Passcode: See email subject 'Data Sciences Institute, UofT -IntroR: Lecture 2' |
Thursday 10 November Tutorial 5pm-6pm Class 6pm-8pm |
Data in R (tibbles, strings, factors, times, missing values) - Understand tidyverse package and applications. - Understand differences in R data types and structures. - Become aware of data subsetting techniques. - Be able to mix data types; distinguish between explicit and implicit coercion. - Perform pattern-matching and string manipulation. - Be able to work with date-time data and categorical data. - Learn how to detect and work with missing values. - 03-data-in-r_deck.pdf (HW: slide 54+) - Recording: https://utoronto-ca-datasciencesinstitute.zoom.us/rec/share/ta7qfgLidOjX8N0EHi4L4fCYVZdR4BRDAFMDPwV8h7l6fr70CG2MlY9uS9mg6lxR.iuEIfdAof3rU7UYa |
Saturday 12 November Tutorial 8:30am-9am noon-12:30pm Class 9am-noon |
Manipulation (filtering; arranging; selecting; mutating, pipe; grouping; summarize) - Be able to upload datasets by recognizing file extensions and suitable functions. - Manipulate tabular data with dplyr: A Grammar of Data Manipulation. - Apply manipulation techniques for data cleaning and summarization. - Use of manipulation techniques for reshaping data for user needs. - 04-manipulation_deck.pdf (HW: slide 50+) - Recording: https://utoronto-ca-datasciencesinstitute.zoom.us/rec/share/H1RWjIguR2hcpdyQP2Tpd4qq6rzJKEgRMXy5Sbj_dPMWBCUfEn9Zbc7_6RTqn5oM.8FbZ5VPj1LVzzDR3 |
Monday 14 November Tutorial 5pm-6pm Class 6pm-8pm |
Wrangling (importing data; pivot, joining data; data.table) - Recognize functions for importing different file types. - Be aware of tidy data rules and limitations. - Be able to generate toy datasets and utilize datasets from R packages. - Perform different joins and distinguish between mutating/filtering joins. - Understand garbage collection system in R. - Be able to determine the memory usage of R sessions. - Identify memory efficient methods of working with large datasets. - 05-wrangling_deck.pdf (HW: slides 45+) - Recording: https://utoronto-ca-datasciencesinstitute.zoom.us/rec/share/BFK0cJORGLm2mCMzq5COW16bpOd8Fo5IUuHIcIGfviD3JXncb1CvUW9J09yYtqdY.92P4ncox0FPp67d1 |
Thursday 17 November Tutorial 5pm-6pm Class 6pm-8pm |
Industry case study: Social Determinants of Health Associated with Patient Portal Use in Pediatric Diabetes - speaker: Nicholas Mitsakakis, PhD, P.Stat. - 12-case_study.pdf - Recording: https://utoronto-ca-datasciencesinstitute.zoom.us/rec/share/-aRlwbHY3Kouc5OSZxNwS5Lj8YLV1Mufy2vyTqYlQ25ucjJ5IOl6rEZ2YvNNNvuE.Z6ak7_75dioVRULY |
Saturday 19 November Tutorial 8:30am-9am noon-12:30pm Class 9am-noon |
Programming (custom functions, loops, if/else logic, purr, simulations) - Identify components and requirements of writing functions. - Understand function structure: arguments, return values and default values. - Learn flow control: for/while loops and conditional statements. - Identify use of functional programming tools for iterations. - Learn to simulate data, randomization and sampling. - 06-programming_deck.pdf (HW: slides 33+) - Recording: https://utoronto-ca-datasciencesinstitute.zoom.us/rec/share/gt7ZIc2bNKQ88WlRXnXxnTtf4Rf2LkxSNWMhAETsxTtF4cjhIxDJIC-tNKmjM0BR.xk9WxQfl3dGaS4wn |
Monday 21 November Tutorial 5pm-6pm Class 6pm-8pm |
Visualization (initialization, choosing chart types, ggplot, customizing) - Become familiar with grammar of graphics. - Learn to initialize a plot, add aesthetics and layers. - Identify how to customize plots with title, labels, axis, theme, size and fills. - Be able to work with colour choices and use of legends. - Become familiar with different visual effects and impact on story telling. - Consider accessibility principles for visualizations. - 07-visualization_deck.pdf (HW: slides 18, 78+) - Recording: https://utoronto-ca-datasciencesinstitute.zoom.us/rec/share/ta2hynzqD8Sup14McivK-08U6ZBDbD59yrmMxlcd_5cn426i_zjDFcra6psEAsLs.lXjskOBHRt9UHNY7 |
Thursday 24 November Tutorial 5pm-6pm Class 6pm-8pm |
Data Interaction with Shiny - Learn how to use and make simple interactive web applications from R. - Learn how to use prebuilt layout, input, and output widgets for user interactions. - Explore and adapt from templates of Shiny app developer community. - Explore shiny apps that are part of Bioconductor and GitHub R packages. - 08-shiny_deck.pdf (HW: slides 10+) - Recording: https://utoronto-my.sharepoint.com/:v:/g/personal/a_silva_utoronto_ca/ET7q21OfwgtDooORYoJUwoUBGT79KnrIq1T_rFPAPBRR8g?e=rEEFlm |
Saturday 26 November Tutorial 8:30am-9am noon-12:30pm Class 9am-noon |
Ethics, inequity and professional skills - Recognize role of ethics in data science. - Touch on concepts of informed consent, privacy, algorithm bias/fairness, and data validity. - Recognize equity, diversity and inclusion practices in data sciences. - Identify professional skills including presentation, project management, and data security/management. - 09-ethics_deck.pdf (HW: slide 13) - 10-inequity_deck.pdf (HW: slide 10) - 11-professional-skills_deck.pdf (HW: slide 22) - Recording: https://utoronto-ca-datasciencesinstitute.zoom.us/rec/share/iL9cG3Gj0R5CWuSB7Ee4KM49mTf8xvc9lpvLkJk5LeOlxQuwAtBvP3HEgmfdSynS.dMfo7rKPFpgeKy6h |
Item | Weight | Purpose and Document Name | Deadline |
---|---|---|---|
00 Pre-course assessment: R/RStudio setup |
0% | Proper setup of R and RStudio, prior to class with the TA. Attendance is optional, but highly recommended to ensure you have proper setup of R and/or RStudio. Document: 00_PrecourseAssignment.pdf |
7 November 2022 before 5.50 pm EST |
01 Class attendance |
10% | Encourage active participation of all attendees in class activities and discussions. Class attendance is mandatory. Ensure you join Zoom using the name provided in course as TA will be marking your attendance. If you are unable to attend class, it is your responsibility to make-up the work that was covered. Tutorial attendance is optional, but highly recommended. |
Ongoing; 7 November to 26 November, 2022 from 6pm-8pm EST; or 9am-12noon |
02 Assessment 1 problem set |
45% | A problem set based on R basics, navigating RStudio, data types and structures, R coercion rules, using built-in functions, working with missing values, use of external functions by downloading R packages, and string manipulation. Document: 02_Assessment1.pdf |
20 November 2022, 9.00 pm EST |
03 Assessment 2 problem set |
45% | A problem set based on data reshaping techniques and tidyverse R package, including application of data manipulation, wrangling, functional programming and data visualization. There will be questions on best R coding practices and EDI practices in data science. Document: 03_Assessment2.pdf |
29 November 2022, 9.00 pm EST |
- The course will include mainly live-coding classes. Learners are expected to follow along with the coding. Be mindful of online fatigue. Be respectful and only one speaker at a time. Use name provided in the course when participating in Zoom. You may use chat or microphone to ask questions. Keep microphones muted, unless you need to speak. Use raise hand feature, and indicate your name before speaking. Keeping your video on is optional, however, if you choose to leave it on, be mindful of what your peers can see. Course communications will take place via email. Learners with diverse learning styles and needs are welcome in this course.
- See above for assessment weights, deadlines and guidelines. All assessment submissions must be done via email, unless stated otherwise. When submitting assessment files, label using this format: LASTNAME_FirstInitial_Assessment.format. E.g., SILVA_A_A1.PDF. Instructions of each assessment will specify the ‘Assessment’ name and format. Students must follow this label format. The student is responsible for emailing correct files on time, in the format specified.
- 10% of the mark will be deducted for each day late, up to 30%. Assignments will NOT be accepted after three days. Be sure to plan well in advance.
- In this course you are expected to follow full disclosure policy: If it’s not your own, new idea, it has a source. All sources must be referenced. For advice on how not to plagiarize, read: https://advice.writing.utoronto.ca/using-sources/, https://www.academicintegrity.utoronto.ca/, and https://guides.library.utoronto.ca/plagiarism. You are responsible for understanding University policies on academic integrity.
- Each student should keep all copies of any assessments submitted.
- While you are encouraged to discuss approaches to assessments with other students, the material turned in must be your own.
- This course, including your participation, may be recorded on video and will be available to students in the course for viewing remotely and after each session. Course videos and materials belong to your instructor, the University, and/or other sources depending on the specific facts of each situation and are protected by copyright. In this course, you are permitted to view session videos and materials for your own academic use, but you should not copy, share, or use them for any other purpose without the explicit permission of the instructor. For questions about the recording and use of videos in which you appear, please contact the instructor.
- Students who are absent from class for any reason (e.g., COVID, other illness or injury, family situation) and who require consideration for missed academic work should report their absence to instructor and TA, and discuss any needed consideration.
- All course material will be available via IntroductionToR GitHub repository. As per prerequisites outlined for the course, it will be assumed that learners are familiar with GitHub. If you are unsure how to download files from GitHub, you may visit the repository link and click 'Code' and then 'Download zip' to download all files, as shown below (see red arrows 1 and 2):
- Alternatively, to download individual files, e.g., 03-Data-in-R PDF slides only, visit the slide link and click the 'Download' button on top right side of page as shown below (see red arrow):
- You may also read about cloning a GitHub repository.
-
Slides covered in the lectures were originally developed by Amy Farrow under the supervision of Rohan Alexander, University of Toronto. Slides have been modified by Anjali Silva for 2022.
-
We wish to acknowledge this land on which the University of Toronto operates. For thousands of years it has been the traditional land of the Huron-Wendat, the Seneca, and most recently, the Mississaugas of the Credit River. Today, this meeting place is still the home to many Indigenous people from across Turtle Island and we are grateful to have the opportunity to work on this land.
- Anjali Silva (a.silva@utoronto.ca), University of Toronto.
IntroductionToR
welcomes issues, enhancement requests, and other contributions. To submit an issue, use the GitHub issues.