-
Notifications
You must be signed in to change notification settings - Fork 7
03 Working with Spreadsheets
We'll have a look at general strategies for how to deal with numbers and text in spreadsheet applications. This is by no means an exhaustive list and the examples are meant to provide some guidance for how to approach specific problems. While the examples below use libreOffice, these concepts apply to all platforms and software vendors. Where individual function-names and menu-items vary, please consult the help and support materials of your platform/software. As always, pick the tools that you are most comfortable with as long as they can get the job done.
If you only need to fix one or two cells, manual input into the .csv
is obviously the fastest way. For everything else think about how to automate your task first. Automated conversions limits the likelihood of accidental errors. It furthermore helps you to work on large numbers of entries, where the difference of transforming five entires or 500 is a question of seconds, instead of hours.
The structure and form of the output is defined by the tables in this repo. To get from the old excel files to the new shiny, it is therefore easier to work with patches of entries that can conveniently fit on you screen. Instead of modifying the originals or final files, I would recommend that your created intermediate files or sheets to work in. Once the results are looking good, you can copy the contents of this file into the csv
and open a pull request.
You should know how to copying and paste multiple rows or columns. The examples below shows you just the intermediate files. You can store such files in your own branches, but they should not be merged into the master branch and need to be removed before you open a PR. You can also just keep such work files on your local hard-drives outside of the repo.
Here is an example work-file (shortened) for place names. I have copied the first two column contents (including column names) from Place.csv
, and left two columns to do my work.
place_ID | pace_name | test | result |
---|---|---|---|
PL0001 | China | ||
PL0002 | Beijing | ||
PL0003 | Shanghai | ||
PL0004 | Chengdu | ||
PL0005 | Xuanhua | ||
PL0006 | Sichuan | ||
… | … |
There are more efficient one step solutions to these examples, you should be able to figure them out quickly on your own. This is just to demonstrate some basic operations but to keep this short I'll reuse the same example in separate steps.
Lets say that you are working on text entries and have a bunch of potentially new place names.
You can copy all new place names into the test
column like this:
place_ID | pace_name | test | result |
---|---|---|---|
PL0001 | China | Bonn | |
PL0002 | Beijing | Xuanhua | |
PL0003 | Shanghai | Hamburg | |
PL0004 | Chengdu | Shanghai | |
PL0005 | Xuanhua | Taiyuan | |
PL0006 | Sichuan | Shanghai |
To create new entries and assign new IDs you want to know which names are already in the database and which aren’t. This is where the vertical lookup function comes in handy.
Select (click) in the cell D2, and enter the following function. You can find a list of available function by clicking on the function wizard or by tipping directly into the function bar next to it:
=VLOOKUP(C2,$B$1:$B$100,1, )
This will look at C2
and compare it to the contents of column B
(up to B100
). If the name appear in both columns C
and B
it will write, otherwise it will print #N/A. You can drag the function to the end of the rows you wish to check.
place_ID | pace_name | test | result |
---|---|---|---|
PL0001 | China | Bonn | #N/A |
PL0002 | Beijing | Xuanhua | Xuanhua |
PL0003 | Shanghai | Hamburg | #N/A |
PL0004 | Chengdu | Shanghai | Shanghai |
PL0005 | Xuanhua | Taiyuan | #N/A |
PL0006 | Sichuan | Shanghai | Shanghai |
There are hundreds of functions available to help you work on the spreadsheet. Many are self-explanatory like SUM
or COUNT
, other notable functions include COUNTIF
, and TRIM
.
Next up we ll activate filters on the table by clicking auto-filter.
If you now click on the little arrow next to the result
you can apply a filter.
Select#N/A
and click OK to only display rows that have #N/A
in the result
column.
place_ID | pace_name | test | result |
---|---|---|---|
PL0001 | China | Bonn | #N/A |
PL0003 | Shanghai | Hamburg | #N/A |
PL0005 | Xuanhua | Taiyuan | #N/A |
You can now copy the contents of test
and add only those place names that you were looking for into Place.csv
.
place_ID | pace_name | test | result |
---|---|---|---|
PL0001 | China | Bonn | #N/A |
… | … | … | … |
PL0006 | Sichuan | Shanghai | Shanghai |
PL0007 | Bonn | ||
PL0008 | Hamburg | ||
PL0009 | Taiyuan |
Once more you could have achieved this with just a single function, but this is just to explain how filters come in handy.
Formatting isn’t an automation tool, but once you start using functions and regular expressions in your spreadsheets you will encounter instances where it becomes important.
If a function has created some output that you want to copy as it appears on screen, instead of how it works behind the scenes, you need to highlight the desired content, right-click and select clear direct formatting
. Now you can be sure that what you copy and paste is what you see on the screen.
Regular Expression, the worlds first write only language, allow you to look for patterns and to modify contents based on them. Think wildcards (*
) on steroids.
To better explain how this works we need a bit more text then with the place example. Say we want to find all date ranges (1967-1971
) inside Act.substantial_discussion
.
Open the Find & Replace
box, select other options
, and tick Regular expressions
.
you can find the list of symbols here
You can search for:
[:digit:]{4}[:space:]?-[:space:]?[:digit:]{4}
to find every instance of four numbers followed by on or zero whitespaces, followed by a minus, followed by …
The possibilities are endless. For more advanced cases you can use regular expressions inside of functions, or use regular expressions for replacements. Even if the sources are not 100% consistent, you can quickly deal with the majority of cases that are. Which will give you time to focus on only those cases that actually require manual attention. Another great use of regular expressions is to check you own data for inconsistencies, do all your dates have 4 digits, are there any commas in your notes or comments?
This was a quick run through of the basic modes of interacting with spreadsheets. There are endless tutorials online that can help you to find a solution to your specific task. One glaring omission is the use of Pivot Tables. While these are a valuable tool, they go beyond the scope of this wiki. If you are curious, you can click the last link to get started