How to Enrich Your Data to Stop Fraud

Fraud is a continual issue for merchants of all sizes. E-commerce platforms like Shopify provide many of the key data points to identify suspicious orders. But how do we use Shopify's data most effectively to focus on the suspicious orders and decide good vs. bad?

In his recent article, 4 Steps to Identify Fraud in Shopify, Alexander Hall shares best practices for manually reviewing Shopify orders, which is all about using human intuition and domain expertise to find suspicious patterns in these data points. As Hall notes, many automated fraud prevention solutions exist as well. Machine learning, in particular, has gotten a lot of attention for its promise to identify fraud patterns automatically.

To understand the connection between the human and artificial intelligence sides of fraud prevention, we need to talk about "data enrichment". Following the helpful scenarios laid out in Hall's article, we'll make this idea concrete, with an aim to:

  1. Demystify how automated fraud detection systems work;
  2. Show that the best solutions arise from human experts and algorithms working together;
  3. Offer a stepping stone between the unassisted manual review and full-blown automation.

What is data enrichment?

Whether manual or automated, the key to identifying suspicious behavior in data is to look for strong signals (aka "risk parameters"). A signal is an attribute for which good customers and fraudsters tend to have different values.

Let's take price as an example. Because fraudsters are greedy, low-dollar orders tend to have lower rates of fraud. While it would be a mistake to make a fraud determination on price alone, it is an example of a useful signal.

In contrast, consider something like IP Address (which Shopify makes available as "Browser IP"). We expect that this data point is important, but IPs are discrete values; sorting by IP isn't very meaningful, and a fraud rate plot would look like a jumbled mess. To use an IP Address, we must first perform data enrichment. As we'll see in a later example, a common approach using IP is to extract location information based on a vendor-supplied data set.

"Data enrichment" transforms and combines raw data elements to create meaningful fraud signals.

Data enrichment does not necessarily require supplementing orders with outside data. Most data enrichment involves transforming and combining the data elements you already have.

Take, for example, Order Create Date and Customer Create Date. We can't do much with either of them alone, but if we subtract one from the other, we get the Account Age, which fraud operators recognize as a strong fraud signal.

When people review orders, they may do this mental math without giving it much attention. But for automated analysis, computers require us to be precise. To illustrate, here is how one would express the account age signal in Scowl, Sumatra's data enrichment language:

account_age := Days(order_created_at - customer_created_at)

This is the most basic example of the general idea we'll see throughout the rest of this article: remixing data to create something more useful.

How enrichment helps

When we can take the data enrichment processes happening in the minds of fraud experts and make them precise enough for a computer to understand them, not only is it the first step toward automated methods, but it also stands to improve manual review methods by:

  • Allowing more precise filters to route only the most suspicious orders for review
  • Saving time in the review process by exposing the enriched signals directly
  • Sharing knowledge among a team of reviewers by making signals available to one another

Scenario #1: Location, Location, Location

Hall describes the following retail scenario as a good candidate for investigation:

AVS-Verified billing address in Topeka, Kansas, with an IP Location in Texas and a shipping address in New York

Setting aside "AVS-Verified" for the moment, we have three locations to consider:

  1. Billing Location: Topeka, KS;
  2. IP Location: Texas;
  3. Shipping Location: New York.

It may seem silly, but what is suspicious about this? The answer, of course, is that the locations are far apart from each other. Distant billing and shipping is itself a signal but could be explained by a gift or a customer who relocated. In either case, the device making the purchase is typically at one of the two other locations. So when all three locations are far apart, that is particularly suspicious.

Those who passed grade school geography in the US know that these locations are far apart, but how do we make our software tools understand that? One of the best ways to do this is to transform all locations into longitude-latitude points and compute their distance.

Conveniently, Shopify already enriches billing and shipping addresses with their longitude and latitude. However, the IP lat and long are not included in the raw data. The Shopify docs point to a few sites where reviewers can copy-paste IPs to get the location. Nearly all fraud platform providers will enrich transactions with IP geolocation data to support automation and make manual reviews more efficient.


For the curious, to compute the distance itself, we use something called the Haversine Formula. If you were to stick two push pins in a globe at your endpoints and connect them with yarn, the formula would tell you how long the curved length of yarn would be. In Scowl, the GeoMiles function gives you this distance in miles. We compute the distances between all three pairs of points using such a function.

Finally, we'd like to boil this down to a single signal. We can do this by finding the distance between the two nearest points. The rationale is that if the two nearest points are far away, then all of the points must be far away from each other.


Coming back to "AVS-Verified," the AVS response is a single-letter code that indicates whether the supplied address and/or zip code match what is on file at the payment provider. With some caveats, a stronger match indicates lower risk. In this scenario, however, the AVS match is important because it establishes that the payment truly belongs to someone in Topeka, giving us more confidence in our distance-based signal.

Using the official decoder ring, we can transform the AVS code into the signal we care about for this scenario. Because the zip code establishes a rough location, we check that at least the zip code matches.

Putting it all Together

The final point to address is: how far apart is too far apart? While common sense and domain knowledge have gotten us this far, humans are no match for machines regarding crunching numbers.

In a machine learning system, we would ingest our enriched signals directly and let the training algorithm decide how best to use them. In a rule-based system, or when simply filtering orders in our database, we aim to choose a threshold that captures a significant amount of fraud while not flagging too many good orders. It is still best to use data to derive threshold choices. However, the statistical techniques for doing so will have to wait for a future post.

In the meantime, we could imagine someone becoming suspicious of an order when all distances are more than, say, 50 miles apart. To see what this all looks like together, here is our full enrichment and rule as a Scowl recipe:

-- Enrich IP with latitude / longitude data
ip_lat, ip_lng := IPLocate(ip)
-- Distance from billing to shipping
bill_ship_miles := GeoMiles(bill_lat, bill_lng, ship_lat, ship_lng)
-- Distance from billing to IP
bill_ip_miles := GeoMiles(bill_lat, bill_lng, ip_lat, ip_lng)
-- Distance from shipping to IP
ship_ip_miles := GeoMiles(ship_lat, ship_lng, ip_lat, ip_lng)
-- The shortest of the three distances
nearest_miles := Minimum(bill_ship_miles, bill_ip_miles, ship_ip_miles)
-- Does the AVS code indicate a zip code match?
avs_zip_match := avs_result in ['X', 'Y', 'W', 'Z']
-- Review if it is an AVS match and all locations are 50+ miles apart
far_apart := Review when avs_zip_match and nearest_miles > 50

If you have ever used a formula in Excel, then don't let this "code" intimidate you. Think of each line as a formula that uses existing columns to create one more new column in your spreadsheet. Like Excel, a good data-enrichment tool lets you see your formulas in action as you create them to ensure they behave as you expect. When you can touch and feel your data, enrichment feels natural.

Scenario #2: What's in a name?

Later in Hall's article, he describes the following scenario:

an email of for a new customer whose billing and shipping information is for a Chris Johnston

Again—silly question—what's wrong with that? Well, more often than not, people choose email handles similar to their names. When we see a name-email mismatch, there is an increased risk of a customer's account being compromised or the billing information being stolen.

How do we make this notion of "similarity" precise so we can turn it into a signal? We, as humans, are such natural pattern-matchers that recognizing the dissimilarity between "A.Smith" and "Chris Johnson" feels effortless. Understanding exactly how our minds do it is a bit of a puzzle. We can outline a sequence of steps for a computer to follow to reach the same result, even if the "thought process" is different.

First, uppercase vs. lowercase is irrelevant, so let's make everything lowercase. Next, let's remove characters from the handle that are definitely not part of the name, like numbers, periods, dashes, etc.

Normalize Handle

Now we have to consider two possibilities: the handle represents the first name followed by the last name; or vice versa. We'll try both possibilities and give the benefit of the doubt by using the score of the better match.

Score Handle

The final missing piece of our signal is the scoring function itself. How many changes would you make to the name to turn it into the email handle (or vice versa)? The fancy name for this approach is the Levenshtein distance. As a convenience, in Scowl, we make this metric available as a score from 0 (bad match) to 1 (good match). Here are a few examples of StringSimilarity in action:

Similarity Handle

The complete recipe for this enrichment can be found here.


Data enrichment is a powerful tool for fraud fighters. It transforms the raw ingredients of data into fully-baked signals that can separate good from the bad. By capturing domain expert best practices as recipes, merchants can filter orders to focus their attention and make faster fraud determinations.

We at Sumatra believe that the domain experts on the front lines of the fraud fight, whether they are sole proprietors or dedicated fraud advisors, are in the best position to craft great signals. Putting the power of data enrichment directly into the hands of non-engineers requires more user-friendly tools, training to master those tools and a community of shared knowledge. If this mission resonates with you, we'd love to hear from you.

In this article, we have barely scratched the surface of what data enrichment can do. So far, we have focused on transforming data in the order itself. As we'll see in the next article in this series, things get interesting when you enrich the order with historical order data and behavioral data from other touchpoints across your ecosystem.

Until then, keep fighting the good fight.