Transact

Sep 15, 2020

This project exists for two reasons. The first is that I would find it personally useful—I currently do something like this a little more manually in Microsoft Excel. The second is that it tests my ability in a couple of areas.

Transact is supposed to work like this: I download my bank statements from my bank for both my Current Account and Credit Card and put the csv file in a data folder. All previously downloaded statements are also kept in this data folder. Each statement should follow on from the last.

All the statements are then imported and parsed in to a pandas dataframe. Some columns are dropped. Then some tidying is required.

For some reason, payments out are negative on Current Account transactions but positive for Credit Card transactions. This needs to be normalised.
The Credit Card Transactions contain a line at the bottom of every statement that is a balance rather than a transaction, which needs to be dropped.
The descriptions are—without fail—absolutely bloody horrible to read. They are all uppercase, they all start with an apostrophe, and they regularly feature vast quantities of extraneous information.

First of all, I need to gather the data. I use glob to get any file in the data folder which has a csv extension, and concatenate all of them in to a pandas dataframe.

import os
import pandas as pd
import numpy as np
import glob

"""Load data into dataframe and tidy up"""
path = os.path.dirname(os.path.abspath(__file__))
all_files = glob.glob(os.path.join(path, "data/*.csv"))
df = pd.concat((pd.read_csv(f, index_col=False) for f in all_files))

Now I have a dataframe which will contain all the data in however many csv files are in the data folder.

But the current account and credit card represent transactions differently: Payments out are negative in a current account and positive in a credit card account.

# current account and credit card have reversed signs
df.loc[df[" Account Name"] != "'Current", " Value"] *= -1

Now all payments out are negative values.

There are some columns I’m not interested in at all, so I drop them. More importantly, the different account types represent dates differently. I need to normalise them, so I turn them into datetime objects:

dff = df.drop([" Account Name", " Account Number", " Balance"], axis=1)
dff["Date"] = pd.to_datetime(dff["Date"])

The credit card statements contain a line which does not represent a transaction, but shows a balance, which would throw everything off. Let’s lose that too:

"""Removes balance lines which are not transactions since,
handily, they are the only line that do not have anything
in the type column"""
dff = dff[dff[" Type"].notna()]

Finally, for some reason that I cannot begin to fathom, all the descriptions begin with an apostrophe. It’s got to go. We can set the descriptions to titlecase while we are at it.

# Remove leading apostrophe and convert to titlecase
dff[" Description"] = dff[" Description"].str[1:].str.title()

The data is where I want it. But the descriptions are absolutely horrible. All caps, random numbers—it’s horrendous. You even get the same retailer with mulitple different descriptions. For instance:

‘Amazon.co.uk*5R7LY5IP5 AMAZON.CO.UK LUX
’AMZNMktplace amazon.co.uk GBR

These are both Amazon as far as I’m concerned. And this happens all the damn time. This is a little tougher to fix.

Descriptions

There are a few words that I know I’m going to remove every time. I have listed them in a txt file that I can add to if I ever find any more in future statements. I read the file and make a list of words to remove.

"""Clean up descriptions"""

remove_words = []

with open('remove.txt', 'r') as f:
    for line in f.readlines():
        remove_words.append(line.strip())

Next, it’s time to strip unwanted characters from our descriptions:

remove = '|'.join([r'\b{}\b'.format(w) for w in remove_words])
dff[' Description'] = dff[' Description'].str.replace(
    remove, '').str.replace('\d+', '').str.split(r'.Co|\*|,| -').str[0].str.strip()

The remove variable is simply all the words in the remove.txt file joined together and separated by a | character.

the following line looks complicated, so I will break it down a little:

dff[' Description'] = dff[' Description'].str.replace(
    remove, '')

I am setting the value of the Description column to the value of Description with any words in the removal list replaced with ''—which is nothing.

Next, we can remove any numbers:

.str.replace('\d+', '')

\d+ is regular expression again. So this part replaces all digits with '' (nothing, again).

.str.split(r'.Co|\*|,| -').str[0].str.strip()

The last part strips everything after .Co or * or , or -.

Example

Take our two Amazon lines earlier:

Amazon.co.uk*5R7LY5IP5 AMAZON.CO.UK LUX
AMZNMktplace amazon.co.uk GBR

We have already converted to titlecase, and we are now stripping everything after .co so the lines read:

Amazon
AMZNMktplace Amazon

Amznmktplace is in my list of words for removal. Thus:

Amazon
Amazon

Well alright. Now we’re cooking with gas.