Unlocking Your Data
Why build a bank statement converter?
Better data makes for better decisions. My bank has the data but limits how much it will let me access, but I want it all. Turns out that PDF is not such a great input source.
I often wonder how my finances are going, like really going. I am definitely earning more than 10 years ago, but I’m also spend a lot more. Am I getting anywhere or just treating water against inflation?
It's hard to know unless you sit down and look at the data. Having worked in the US and Australia over the last 10 years my financial data is spread all over the place. My partner and I buy everything on card, so I know all the data is there.
- Locked Data. My banks provides 7 years of PDF bank statements. That I can't import into anything.
- Limited Data. I can also access 2 years of transaction data as a CSV.
- 3rd Parties. Plaid, Salt Edge and Yodlee only give you two years of data. It wasn't crystal clear to me if using them violated my banks terms of service.
Analysing our spending habits and finances should be easy right? We both bank with the same bank in Australia and used the same bank in the US. I figured it would be easy to export my bank data thanks to open banking. This was not my experience.
Isn't this a solved problem?
Surely software to convert bank statements into a usable format is a solved problem? Right? Turns out the answer is meh. I signed up for a bunch of free trials and bought a few licences but got pretty average results. None of them where able to give me flawless results.
Some of the more common limitations were; failing due to ads in the middle of the transactions table, not being able to handle multiple accounts in the one PDF, requiring manual input to select an area to extract from or marking the transaction table column positions.
With 453 statements between my partner and I from the last 7 years across 4 banks, anything that required manual input just wasn't going to work for us.
Sanity Checking
Having done a bunch of data migrations over the years I like to eye ball data before writing too much code. So, I started collecting a dataset of bank statements from friends to test against. Turns out there is a lot of variety in how banks structure their PDF layouts. Things like; table column names, types of data, having ads interspersed with transactions, tax notices appearing with transactions, the list goes on. What's more banks like to change the layout every three years or so.
Creating a generic bank statement extractor is a non-trivial problem but well-defined one.
So I built and shipped a slightly embarrassing version. It's good enough to extract the transaction data from my dataset of samples. Currently the output is the same layout as the bank statement. By the end of April I plan to include the option to export data into a more standardised format. This means filling in missing dates, merging transactions descriptions, filling in missing balances and most importantly being able to export into Xero's input format.
Being able to drag and drop my full 453 statements and have it give me back a single file for each bank account is the goal. This should come with a balance reconciliation summary.
Is this going to be useful to other people? I hope so, I suspect that other people interested in FIRE (Financial Independence Retire Early), Forensic Accounting and Merger and Acquisition Due Diligence will find it interesting. If there demand I might open it up as an API for budgeting apps to build on top of.
Exciting times ahead!