Because some FIs don't let you download .csvs for transactions older than 6 months ago because ???, this abomination of a tool exists.
tl;dr - this parses PDFs with pdfplumber into text, then runs a bunch of regex on it to capture transactions with some sanity/validation checking to make sure we got all the transactions correct.
- Pre-commit
2d826cf9e6d96f8498411ea945494cabdd0dd7a8
, there was an issue where statements in Jan would have the incorrect year set.- Remediation: Delete all Dec transactions for all years, re-parse Jan statements.
- Removed unneeded dependencies (The original repo actually only used PDFPlumber to parse, not sure why there was tabular left in there)
- Removed support for savings/chequing accounts (you shouldn't be spending out of a chequing account anyway, should be for bill payments only)
- Remove dependency on PDF filename for dates
- Move regex into a centralized dictionary to more easily support other FIs
- Add regex for credit balance on credit cards
- Added statements for easy debugging
- Fix validator function so it makes sense for credit cards
- Add duplicate transaction manual override, as the original author incorrectly assumed that same description, date and amounts were invalid
- Store dates in YYYY-MM-DD for easily CSV import to Firefly III
- Verified FI list:
- BMO MC
- TD Visa
- Manulife Visa
- RBC Visa/MC
- AMEX
- Use more efficient regex, maybe 1 for all FIs but I'm lazy
- Detect account type from statements instead of relying on directory structure
- Docker support for ETL directly into CSV/Firefly III (although efforts are better spent on automated headless selenium .csv interval fetch)
- You would probably only use this tool once to get your old transactions out of PDFs since some banks stop offering .csvs past the recent 3-6 month window
- Clone this repo
- Use a venv
# macOS/linux
python3 -m venv venv
source venv/bin/activate
# windows
python3 -m venv venv
source venv/Scripts/activate
- Install dependencies
(venv) pip install -r requirements.txt
-
Download all your e-statements (v boring)
-
Put the downloaded pdfs into the
statements/TD
,statements/BMO
,statements/RBC
directories depending on what FI the statements are for.
statements
βββ TD
| βββ XXXXXXXXX-2020May25-2020Jun25.pdf
| ...
βββ RBC
| βββ XXXXXXXXX-2020May25-2020Jun25.pdf
| ...
βββ BMO
βββ XXXXXXXXX-2020May12-2020Jun10.pdf
...
- Statement names don't matter, thanks to another fork we parse the PDF for the date
- Set
TARGET_FI
to your directory name, ex.TARGET_FI = 'BMO'
- Run it!
(venv) python teller.py -d statements teller.db
If you put the statements somewhere else, specify the path to their parent directory with the -d
option.
teller.db
, a sqlite3 database file, will contain all the transaction data. You can just leave the data there, and later add new statements and rerun with the same .db file - the tool will manage uniqueness of transactions in the database (duplicate files are fine). I recommend rerunning later with only new statements to save time.
You can use the sqlite3 CLI to run queries, but I recommend using DB Browser for SQLite.
Now you can have fun running queries and feeling bad about your spending habits. For example:
SELECT sum(amount) FROM transactions WHERE description LIKE '%Dunbar Sushi%'
If you see:
Error for statements/BMO/eStatement_2019-05-07.pdf
Discrepancy found, bad parse :(
This means that there are missing transactions not picked up by the regex because the math of starting balance minus the closing balance is not the net amount.
The transaction regex is 3 parts because a transaction has 3 data points. The item in the regex array is named txn
.
The regex is inbetween the "
's, and this is what you put in the simulators.
# date
r"^(?P<dates>(?:\w{3}(\.|)+ \d{1,2} ){2})"
# description
r"(?P<description>.+)\s"
# amount
r"(?P<amount>-?[\d,]+\.\d{2})(?P<cr>(\-|\s?CR))?"
- Turn on debug mode in
teller/pdf_processor.py
by settingdebug = True
- Run it again just on the error statement by moving the other statements out of the dir or into a sub dir
- Inspect the transactions displayed
As you can see, the dates are different so it's probably the date regex.
This confirms that it is the date regex.
After fixing the regex, we can see we can capture the dates again.
Turn off the debug flag and re-run. As you can see, it works perfect now.
Please note that you don't need to delete the .db file and recreate it. It will append new transcations automatically.
Unfortunately, we can never have 100% correct bank statement regex since the bank changes it's formats slightly between statements. Not sure if its intentional or not, but it's very annoying.
For data browsing sanity checks, I recommend sqlitebrowser: https://sqlitebrowser.org/
- Add a bank name to AccountType in
teller/model.py
, ex. BNS (Scotiabank ticker) - Set the
debug
flag toTrue
inteller/pdf_processor.py
- Create the corresponding dir in
statements/BNS
- Drop BNS CC statements in there
- Create a new dict entry in the
regexes
dict inteller/pdf_processor.py
- Copy paste one of the existing regexes, we will tweak it later
- Run it
- Copy the output and grab the parts we care about (opening balance, closing balance, statement date range and transactions) - be careful not to paste your address/CC number into the simulator
- Put it into a regex simulator like https://regex101.com/ or https://regexr.com/ or use the regex feature on your text editor
- Test every regex, modify until it grabs what we need
- Update the regex inside your new dict entry at
regexes['BNS']
- Turn off
debug
flag inteller/pdf_processor.py
, set toFalse
- Run it again, if it fails repeat step 9-11 until it works
It shouldn't take long, it took me 5 minutes to add AMEX support once I got the workflow down.