Skip to content

03 Working with Spreadsheets

Duncan Paterson edited this page Aug 3, 2018 · 7 revisions

We'll have a look at general strategies for how to deal with numbers and text in spreadsheet applications. This is by no means an exhaustive list and the examples are meant to provide some guidance for how to approach specific problems. While the examples below use libreOffice, these concepts apply to all platforms and software vendors. Where individual function-names and menu-items vary, please consult the help and support materials of your platform/software. As always, pick the tools that you are most comfortable with as long as they can get the job done.

Creating a semi-automated workflow

If you only need to fix one or two cells, manual input into the .csv is obviously the fastest way. For everything else think about how to automate your task first. Automated conversions limits the likelihood of accidental errors. It furthermore helps you to work on large numbers of entries, where the difference of transforming five entires or 500 is a question of seconds, instead of hours.

Working with temporary files

The structure and form of the output is defined by the tables in this repo. To get from the old excel files to the new shiny, it is therefore easier to work with patches of entries that can conveniently fit on you screen. Instead of modifying the originals or final files, I would recommend that your created intermediate files or sheets to work in. Once the results are looking good, you can copy the contents of this file into the csv and open a pull request.

You should know how to copying and paste multiple rows or columns. The examples below shows you just the intermediate files. You can store such files in your own branches, but they should not be merged into the master branch and need to be removed before you open a PR. You can also just keep such work files on your local hard-drives outside of the repo.

Here is an example work-file (shortened) for place names. I have copied the first two column contents (including column names) from Place.csv, and left two columns to do my work.

place_ID pace_name test result
PL0001 China
PL0002 Beijing
PL0003 Shanghai
PL0004 Chengdu
PL0005 Xuanhua
PL0006 Sichuan

There are more efficient one step solutions to these examples, you should be able to figure them out quickly on your own. This is just to demonstrate some basic operations but to keep this short I'll reuse the same example in separate steps.

Functions

Lets say that you are working on text entries and have a bunch of potentially new place names.

You can copy all new place names into the test column like this:

place_ID pace_name test result
PL0001 China Bonn
PL0002 Beijing Xuanhua
PL0003 Shanghai Hamburg
PL0004 Chengdu Shanghai
PL0005 Xuanhua Taiyuan
PL0006 Sichuan Shanghai

To create new entries and assign new IDs you want to know which names are already in the database and which aren’t. This is where the vertical lookup function comes in handy.

Select (click) in the cell D2, and enter the following function. You can find a list of available function by clicking on the function wizard or by tipping directly into the function bar next to it:

=VLOOKUP(C2;$B$1:$B$100;1; )

This will look at C2 and compare it to the contents of column B (up to B100). If the name appear in both columns C and B it will write, otherwise it will print #N/A. You can drag the function to the end of the rows you wish to check.

place_ID pace_name test result
PL0001 China Bonn #N/A
PL0002 Beijing Xuanhua Xuanhua
PL0003 Shanghai Hamburg #N/A
PL0004 Chengdu Shanghai Shanghai
PL0005 Xuanhua Taiyuan #N/A
PL0006 Sichuan Shanghai Shanghai

There are hundreds of functions available to help you work on the spreadsheet. Many are self-explanatory like SUM or COUNT, other notable functions include COUNTIF, and TRIM.

Filters

Next up we ll activate filters on the table by clicking auto-filter. If you now click on the little arrow next to the result you can apply a filter. Select#N/A and click OK to only display rows that have #N/A in the result column.

place_ID pace_name test result
PL0001 China Bonn #N/A
PL0003 Shanghai Hamburg #N/A
PL0005 Xuanhua Taiyuan #N/A

You can now copy the contents of test and add only those place names that you were looking for into Place.csv.

place_ID pace_name test result
PL0001 China Bonn #N/A
PL0006 Sichuan Shanghai Shanghai
PL0007 Bonn
PL0008 Hamburg
PL0009 Taiyuan

Once more you could have achieved this with just a single function, but this is just to explain how filters come in handy.

Formatting

Formatting isn’t an automation tool, but once you start using functions and regular expressions in your spreadsheets you will encounter instances where it becomes important.

If a function has created some output that you want to copy as it appears on screen, instead of how it works behind the scenes, you need to highlight the desired content, right-click and select clear direct formatting. Now you can be sure that what you copy and paste is what you see on the screen.

Regular Expression

Regular Expression, the worlds first write only language, allow you to look for patterns and to modify contents based on them. Think wildcards (*) on steroids.

To better explain how this works we need a bit more text then with the place example. Say we want to find all date ranges (1967-1971) inside Act.substantial_discussion.

Open the Find & Replace box, select other options, and tick Regular expressions.

you can find the list of symbols here

You can search for: [:digit:]{4}[:space:]?-[:space:]?[:digit:]{4} to find every instance of four numbers followed by on or zero whitespaces, followed by a minus, followed by …

The possibilities are endless. For more advanced cases you can use regular expressions inside of functions, or use regular expressions for replacements. Even if the sources are not 100% consistent, you can quickly deal with the majority of cases that are. Which will give you time to focus on only those cases that actually require manual attention. Another great use of regular expressions is to check you own data for inconsistencies, do all your dates have 4 digits, are there any commas in your notes or comments?

Outlook

This was a quick run through of the basic modes of interacting with spreadsheets. There are endless tutorials online that can help you to find a solution to your specific task. One glaring omission is the use of Pivot Tables. While these are a valuable tool, they go beyond the scope of this wiki. If you are curious, you can click the last link to get started. Part 2 will show you how filters, function, etc. can come in handy not just for inputting data, but also for analyzing and visualizing it.