By: Bart Baesens, Seppe vanden Broucke
This QA first appeared in Data Science Briefings, the DataMiningApps newsletter as a “Free Tweet Consulting Experience” — where we answer a data science or analytics question of 140 characters maximum. Also want to submit your question? Just Tweet us @DataMiningApps. Want to remain anonymous? Then send us a direct message and we’ll keep all your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to data on data science news, or follow us @DataMiningApps.
You asked: How can association rules be used for fraud detection? Can you give an example?
Our answer:
Association rules detect frequently occurring relationships between items. They were originally introduced in a market basket analysis context to detect which items are frequently purchased together. The key input is a transactions database D consisting of a transaction identifier and a set of items {i1, i2, …, in} selected from all possible items I. An association rule is then an implication of the form X→Y, whereby X ⊂ I, Y ⊂ I and X ∩ Y = ∅. X is referred to as the rule antecedent whereas Y is referred to as the rule consequent. Examples of association rules could be:
- If a customer has a car loan and car insurance, then the customer has a checking account in 80% of the cases.
- If a customer buys spaghetti, then the customer buys red wine in 70% of the cases.
- If a customer visits Web page A, then the customer will visit Web page B in 90% of the cases.
It is hereby important to note that association rules are stochastic in nature. This means that they should not be interpreted as a universal truth, and are characterized by statistical measures quantifying the strength of the association. Furthermore, the rules measure correlational associations and should not be interpreted in a causal way.
In a fraud setting, association rules can be used to detect fraud rings in insurance. The transaction identifier then corresponds to a claim identifier and the items to the various parties involved such as the insured, claim adjuster, police officer and claim service provider (e.g. auto repair shop, medical provider, home repair contractor, etc.). Let’s consider an example of a transactions database as depicted below:
Claim identifier | Parties involved |
1 | insured A, police officer X, claim adjuster 1, auto repair shop 1 |
2 | insured A, claim adjuster 2, police officer X |
3 | insured A, police officer Y, auto repair shop 1 |
4 | insured A, claim adjuster 1, police officer Y |
5 | insured B, claim adjuster 2, auto repair shop 2, police officer Z |
6 | insured A, auto repair shop 1, auto repair shop 2, police officer X |
7 | insured C, police officer X, auto repair shop 1 |
8 | insured A, auto repair shop 1, police officer Z |
9 | insured A, auto repair shop 1, police officer X, claim adjuster 1 |
10 | insured B, claim adjuster 3, auto repair shop 1 |
The goal is now to find frequently occurring relationships or association rules between the various parties involved in the handling of the claim. This will be solved using a two-step procedure. In step 1, the frequent item sets will be identified. The frequency of an item set is measured by means of its support which is the percentage of total transactions in the database that contains the item set. Hence, the item set X has support s if 100s% of the transactions in D contain X. It can be formally defined as follows:
Consider the item set {insured A, police officer X, auto repair shop 1}. This item set occurs in transactions 1, 6 and 9 hereby giving a support of 3/10 or 30%. A frequent item set can now be defined as an item set of which the support is higher than a minimum value as specified by the data scientist (e.g. 10%). Computationally efficient procedures have been developed to identify the frequent item sets.
Once the frequent item sets have been found, the association rules can be derived in step 2. Multiple association rules can be defined based upon the same item set. Consider the item set {insured A, police officer X, auto repair shop 1}. Example association rules could be:
- If insured A and police officer X then auto repair shop 1
- If insured A and auto repair shop 1 then police officer X
- If insured A then auto repair shop 1 and police officer X
The strength of an association rule can be quantified by means of its confidence. The confidence measures the strength of the association and is defined as the conditional probability of the rule consequent, given the rule antecedent. The rule X → Y has confidence c if 100c% of the transactions in D that contain X also contain Y. It can be formally defined as follows:
Consider the association rule “If insured A and police officer X then auto repair shop 1”. The antecedent item set {insured A, police officer X} occurs in transactions 1, 2, 6 and 9. Out of these 4 transactions, 3 also include the consequent item set {auto repair shop 1}, which results into a confidence of 3/4 or 75%. Again, the data scientist has to specify a minimum confidence in order for an association rule to be considered interesting.
Once all association rules have been found, they can be closer inspected and validated. In our example, the association “If insured A and police officer X then auto repair shop 1” does not necessarily imply a fraud ring, but it’s a least worth the effort to further inspect the relationship between these parties.