Data Matching Algorithm

From MTHWiki

Revision as of 18:45, 19 October 2010 by Wikiadmin (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

Contents

Payee Name Resolution

Most banks generate payee name based on the Point Of Sale (POS) identifier. The exact way this generation happens is different from bank to bank, but most of the time it does include the business name that originated the transaction and some sort of POS identifier.

For example, transactions originated from our favorite coffee shop, Dunkin Donuts on Milk Street in Boston are settled in the Bank of America as originated by "DUNKIN #343418 Q35 ".

So how does My Money match a bank originated payee with a internal payee, in this case Dunkin Donuts ?

Synonyms

Almost every object within My Money can carry a list of synonyms. This list is a simple comma separated list of other potential names or aliases than this object is known under. So we if we define a payee "Dunkin Donuts" we can attach as many synonyms to it as we please, for example "Dunkin Donuts Shop #35, Dunkin Coffee".

Image:Dunkinsynonyms.png

My Money string matching algorithms always look for the name matches first and then for synonym matches. If the downloaded payee name and either internal payee name or its synonyms match, then we found our payee.


String Analysis

If there is no apparent match between the downloaded name and My Moneys internal objects then further string analysis is performed. In earlier versions of My Money the name matching algorithm mostly looked for a match between the words in the strings. As our customers noted this approach was deeply flawed because if they had a payee with a common word bread in it, then Atlanta Bread, Panera Bread as well as Au-Bon Pain Bread would be matched to the same payee.

Obviously some words are more important than the others. This problem is well understood in Computer Science, where one of the branches is called Natural language processing (NLP) and it deals deals with the problems of automated generation and understanding of natural human languages.

Our internal name matching algorithm is highly influenced by Jeremy Hiltons seminal paper Identifying and Merging Related Bibliographic Records, published as his Master of Engineering Thesis at MIT in 1996. In a nutshell, we analyze the strings by splitting them into sequences (Markov chaining) and sending them through multi-step N-Gram analysis to calculate how statistically important these sequences are. Then we rank our confidence level and see whether we can match an internal payee.

Here is the screen shot from the Import/Export Options page that controls this process.

Image:Stringmatcher.png

In this example the user has specified to match the strings only when confidence level is 70% or better, but two strings are compared to each other with confidence level of 58%.

My Money will ignore this potential match and will create a brand new Payee.

If the user drops the confidence level to 50% or better, then the strings are matched positively and an existing payee is reused.

Downloaded/Imported Transaction Matching

My Money is designed from ground up to work with the data generated by the banks. We put in a lot of effort and analyzed thousands of electronic account statements to understand better at how to deal with the challenges of reading bank information and matching and merging it with the manually entered transactions in the register.

In the United States and Canada, most banks would allow their customers to download bank statements either in older QIF format or newer OFX. My Money works with both of those formats as well as their permutations, such as QFX (OFX with Quicken Extensions), Active Money and undocumented multi-account QIF.

Please note that My Money does not support Microsofts older OFC (Open Financial Connectivity) and has no plans of supporting it in the future.


QIF Transactions

QIF is an older format and while there are many inconsistencies between different versions of QIF one of the biggest problems of the format is that QIF lacks a way of identifying a transaction:

My Money tries to match these downloaded transactions with internal transaction using the following algorithm:

1) Check numbers between two transactions match

2) Amounts match

3) The dates are the same (* see below for lose vs strict mode)

4) Payees match, within the confidence level

5) In split transactions number of splits, split amounts and split payees match

6) In double entry transactions such as transfers and split transfers the right sides are pointing to the same account


If all these criteria is met we consider transactions matching each other and merge the data between the two.

When downloading data from banks we allow for some differences in dates, as it takes up to 4 business days for a bank to clear a check. So when trying to match a transaction we look in a range of 4 days trying to see whether other criteria matches (amounts, payees etc)

When importing data, such as importing from Microsoft Money, we don't allow for any differences, dates should match to a business day when we run the matching algorithm.

OFX/QFX Transactions

All OFX/QFX transactions carry a unique identifier called ofxid. This id is normally generated by a bank and could be anything, we have seen banks generating ids in excess of 100 characters and we also have seen quite short ids. If this id is present in downloaded data we merge it into internal transaction and use it later to identify potential duplicates. A match between two ofxids means we have a identical transaction and when during the import My Money finds this match it will merge transaction information with emphasis on downloaded data, i.e. Bank data will be used to fill missing fields. User entered information, such as Payees and Categories are kept intact.

Personal tools