- CodeSigning
- AnimalCrossingEconomy
- CarbonEmissionAnalysis
- KnowThyEnemy
- WalkabilityAnalysis
- LASharedBikeAnalysis
- I want money
- ToTheMoon
- SportsSalaries
- jailbreak-gpt
- Predicting_Movie_Revenue
- CovidImpact
- PopulationAnalysis
- DrugLink
- SalaryRentComparison
- CostOfLiving
- CarAccidentAnalysis
- RentPricingAnalysis
- FitnessandSleepAnalysis
- Something's Fishy
- Medical Data Analysis
- TikTok_Vs._Spotify
- YoutubeMetadata
- ChatGPT_Insecure_Code_Analysis
- GDPandGDI_correlation
- GithubWrapped
- Final project presentations
- MP5 due on Dec 5
- work on final projects
- Please correct your MP4/2 by Nov 21: lineFix==true only if the gpt output contains corrected line, otherwise it is false. In plenty of instances it is set incorrectly. Please inspect and correct: producing misleading curated data is a very serious problem.
- MP5 clarifications
- work on final projects
- Part 2 of MP4 Due: please correct your MP4 by Nov 21: lineFix==true only if the gpt output contains corrected line, otherwise it is false. I noticed many instances where that is not the case
- MP5
- WoC Hackathon Nov 17-19 for extra credit (can accomodate up to six students) registration
- work on final projects
- Finish status updates
- SalaryRentComparison
- TikTok_Vs._Spotify
- CodeCommitAnalysis
- animal-crossing-economy
- PopulationAnalysis
- CodeSigning
- CovidImpact
- GDPandGDI_correlation
- ToTheMoon
- LASharedBikeAnalysis
- Schedule FP presentations for Nov 28, 30, Dec 5
- Brief status updates on final projects
- Questions on MP4
- Part 1 of MP4 due
- Work on final project, MP3 due Oct 24
- MP3 grading scheme: nothing - 0, 95%+ of all entries captured - 15; 50%+ of all entries captured - 10; > 0 and < 50% of all entries captured - 5
- Questions on MP4
- Questions on MP3
- MP4
- GCP + MP2 due Current status of MP2
- MP3
- Questions about MP2
- Questions about GCP
- Questions about MP2
- Questions about GCP
- How to use GCP
- MP2: discovery
- Finish FP Proposal, see instructions
- Finish/Questions on project proposal
- Data discovery
- Data storage
- Finish boasters, form most teams
- Presenting MP1 results by the representatives of each group
- the presentations will go in group order (the representative from the first group, the second group...)
- Group 12 cfishe36
- Group 14 alay10
- Group 15 jkim172
- Group 16 knuchol1
- Class(final) project boasters
- Presenting MP1 rsults in the assigned groups (see below)
1. vpk542 jhawki41 rcarnes jsteed kpatel68
2. rrosenb4 xzl263 bhaynie mtiwari tpanumat
3. gbb823 ebriggs4 sjeroute ttahmid lliu58 lsmit248
4. mwang43 ipelton smalluri hdehler wcuny emoran11
5. jskeen6 asharm42 san5 andlrutt cgraha37 kzeligow
6. jskupien slaughl2 jblackab wfortner jhulen
7. jbower31 pgajjala jmcelr10 fhill5 jchen125 oselyuti
8. bmarth loneal7 smistry1 afriend3 jking148
9. hcurl kchrist gjur1 kcraddoc echavez2 nshoap
10. afranz1 cheadri6 jbrouss2 shuang24 hnguye48
11. bnd674 lbower10 zperry4 jhalloy jrich19
12. cfishe36 mpatel65 naskew skerzel zmille10 dwoun
13. wouyang2 cshubert evaugha3 ayu5 mstott3
14. alay10 jarmiger azeng2 rlau agreer26 twu21
15. rgarg4 ely1 amistry2 jkim172 hli102 lswann
16. wquesinb jmckni13 jshoffn3 knuchol1 sshiran1
- Continue work on on MP1
- Work on MP1, including discussing with your assigned peer
- Make sure you have
a. Forked fdac23/Miniproject1 b. Posted the idea for your analysis on your peer's fork c. Responded to the idea that was posted by your peer
- Question regarding MP1
- Boasters for class project
- World of Code dataset
- See the simple text analysis of your descriptions
- Introducing the MiniProject1 process and template
- Think about selecting the course project (see course projects for the last seven years at fdac22, fdac21, fdac20, fdac19, fdac18, fdac17, fdac16, fdac for inspiration)
- Boasters for class project (if you have an idea for the class project, please commit to fdac23/FinalProjectPitches)
- Follow instructions to make sure your ssh is set up to connect to your docker container
- Work on fdac23/Practice0: due before class on Sep 7
- It involves
- forking
- ssh and clone to your docker container
- rename the notebook on your container
- completing notebook in your browser (while connected to yor container)
- adding/commiting/pushing from your container
- creating pull request from your fork
- If you need a refresher on unix tools: edX on unix for data science
- It involves
- Critical Tools
- Version Control
- Magic of Internet
- 88 registered for the class, but only 80 have submitted PR: see instructions for the previous class
- Please accept your invitation to fdac23 organization while logged in to GH via handle you used to submit pull request
- If you have not done so yet, please accept github fdac23 invitation
- Introductory lecture
- Critical Tools
- Version Control
- Create your github account
- fork repo students
- create your utid.md file providing your name and interests and what you want to get out of the course (at least a full paragraph, see example): see per fdac23/students/README.md, and also upload your your public ssh key to your account on github. Once done, please
- submit a pull request to fdac23/students
- Make sure you do it a day before the next class so we can start ready
-
Join from a PC, Mac, iPad, iPhone or Android device: Please click this URL to start or join. https://tennessee.zoom.us/j/2766448345 Or, go to https://tennessee.zoom.us/join and enter class session/meeting ID: 276 644 8345
-
Join from dial-in phone line: (Note: these are NOT toll-free numbers); Dial: +1 646 558 8656 or +1 408 638 0968 Meeting ID: 276 644 8345; Participant ID: Shown after joining the meeting; International numbers available: https://tennessee.zoom.us/zoomconference?m=leg4C6yjhpfGHE-_Q9EYRNHXCUMBC-2T
-
Join the Discord server from this link and follow the instructions
- Course: [COSCS-445/COSCS-545]
- ** Zoom link above ** and in MK524
- ** TTh 11:20-12:35
- Instructors: Audris Mockus, audris@utk.edu (office hours - upon request)
- TAs: Ben Klein bklein3@vols.utk.edu Taylor Villarreal tvillarr@vols.utk.edu Office hours - TBD
- Need help?
Simple rules:
- There are no stupid questions. However, it may be worth going over the following steps:
- Think of what the right answer may be.
- Search online: stack overflow, etc.
- code snippets: On GH gist.github.com or, if anyone contributes, for this class
- answers to questions: Stack Overflow
- Look through issues
- Post the question as an issue.
- Ask instructor: email for 1-on-1 help, or to set up a time to meet
The course will combine theoretical underpinning of big data with intense practice. In particular, approaches to ethical concerns, reproducibility of the results, absence of context, missing data, and incorrect data will be both discussed and practiced by writing programs to discover the data in the cloud, to retrieve it by scraping the deep web, and by structuring, storing, and sampling it in a way suitable for subsequent decision making. At the end of the course students will be able to discover, collect, and clean digital traces, to use such traces to construct meaningful measures, and to create tools that help with decision making.
Upon completion, students will be able to discover, gather, and analyze digital traces, will learn how to avoid mistakes common in the analysis of low-quality data, and will have produced a working analytics application.
In particular, in addition to practicing critical thinking, students will acquire the following skills:
-
Use Python and other tools to discover, retrieve, and process data.
-
Use data management techniques to store data locally and in the cloud.
-
Use data analysis methods to explore data and to make predictions.
A great volume of complex data is generated as a result of human activities, including both work and play. To exploit that data for decision making it is necessary to create software that discovers, collects, and integrates the data.
Digital archeology relies on traces that are left over in the course of ordinary activities, for example the logs generated by sensors in mobile phones, the commits in version control systems, or the email sent and the documents edited by a knowledge worker. Understanding such traces is complicated in contrast to data collected using traditional measurement approaches.
Traditional approaches rely on a highly controlled and well-designed measurement system. In meteorology, for example, the temperature is taken in specially designed and carefully selected locations to avoid direct sunlight and to be at a fixed distance from the ground. Such measurement can then be trusted to represent these controlled conditions and the analysis of such data is, consequently, fairly straightforward.
The measurements from geolocation or other sensors in mobile phones are affected by numerous (yet not recorded) factors: was the phone kept in the pocket, was it indoors or outside? The devices are not calibrated or may not work properly, so the corresponding measurements would be inaccurate. Locations (without mobile phones) may not have any measurement, yet may be of the greatest interest. This lack of context and inaccurate or missing data necessitates fundamentally new approaches that rely on patterns of behavior to correct the data, to fill in missing observations, and to elucidate unrecorded context factors. These steps are needed to obtain meaningful results from a subsequent analysis.
The course will cover basic principles and effective practices to increase the integrity of the results obtained from voluminous but highly unreliable sources.
-
Ethics: legal aspects, privacy, confidentiality, governance
-
Reproducibility: version control, ipython notebook
-
Fundamentals of big data analysis: extreme distributions, transformations, quantiles, sampling strategies, and logistic regression
-
The nature of digital traces: lack of context, missing values, and incorrect data
Students are expected to have basic programming skills, in particular, be able to use regular expressions, programming concepts such as variables, functions, loops, and data structures like lists and dictionaries (for example, COSC 365)
Being familiar with version control systems (e.g., COSC 340), Python (e.g., COSC 370), and introductory level probability (e.g., ECE 313) and statistics, such as, random variables, distributions and regression would be beneficial but is not expected. Everyone is expected, however, to be willing and highly motivated to catch up in the areas where they have gaps in the relevant skills.
All the assignments and projects for this class will use github and Python. Knowledge of Python is not a prerequisite for this course, provided you are comfortable learning on your own as needed. While we have strived to make the programming component of this course straightforward, we will not devote much time to teaching programming, Python syntax, or any of the libraries and APIs. You should feel comfortable with:
- How to look up Python syntax on Google and StackOverflow.
- Basic programming concepts like functions, loops, arrays, dictionaries, strings, and if statements.
- How to learn new libraries by reading documentation and reusing examples
- Asking questions on StackOverflow or as a GitHub issue.
These apply to real life, as well.
- Must apply "good programming style" learned in class
- Optimize for readability
- Bonus points for:
- Creativity (as long as requirements are fulfilled)
- Agree on an editor and environment that you're comfortable with
- The person who's less experienced/comfortable should have more keyboard time
- Switch who's "driving" regularly
- Make sure to save the code and send it to others on the team
-
Class Participation – 15%: students are expected to read all material covered in a week and come to class prepared to take part in the classroom discussions (online). Asking and responding to other student questions (issues) counts as a key factor for classroom participation. With online format and collaborative nature of the projects, this should not be hard to accomplish.
-
Assignments - 40%: Each assignment will involve writing (or modifying a template of) a small Python program.
-
Project - 45%: one original project done alone or in a group of 2 or 3 students. The project will explore one or more of the themes covered in the course that students find particularly compelling. The group needs to submit a project proposal (2 pages IEEE format) approximately 1.5 months before the end of term. The proposal should provide a brief motivation of the project, detailed discussion of the data that will be obtained or used in the project, along with a time-line of milestones, and expected outcome.
-
Scale
letter | percent |
---|---|
a | 95 |
a- | 93 |
b+ | 90 |
b | 88 |
b- | 85 |
c+ | 83 |
c | 79 |
c- | 75 |
As a programmer you will never write anything from scratch, but will reuse code, frameworks, or ideas. You are encouraged to learn from the work of your peers. However, if you don't try to do it yourself, you will not learn. deliberate-practice (activities designed for the sole purpose of effectively improving specific aspects of an individual's performance) is the only way to reach perfection.
Please respect the terms of use and/or license of any code you find, and if you re-implement or duplicate an algorithm or code from elsewhere, credit the original source with an inline comment.
This class assumes you are confident with this material, but in case you need a brush-up...
- A MongoDB Schema Analyzer. One JavaScript file that you run with the mongo shell command on a database collection and it attempts to come up with a generalized schema of the datastore. It was also written about on the official MongoDB blog.
- Modern Applied Statistics with S (4th Edition) by William N. Venables, Brian D. Ripley. ISBN0387954570
- R
- Code School
- Quick-R
- Git and GitHub
- GitHub Pages
Similar to proposals, but note additional sections:
- Objective (research question)
- Data that was used: how obtained, how processed, integrated, and validated
- What models or algorithms were used
- Results: A description of the results
- Primary issues encountered during the project
- Future work: ideas generated, improvements that would make sense, etc
- Org chart: rough timeline and responsibilities for each member