Fake It Till You Make It: Generating Realistic Synthetic Customer Datasets

Republished By Plato

Followers: 0

Fake It Till You Make It: Generating Realistic Synthetic Customer Datasets

Being able to create and use synthetic data in projects has become a must-have skill for data scientists.

I have written in the past about using the Python library Faker for creating your own synthetic datasets. Instead of repeating anything in that article, let’s treat this as the second in a series of generating synthetic data for your own data science projects. This time around, let’s generate some fake customer order data.

If you don’t know anything about Faker, how it is used, or what you can do with it, I suggest that you check out the previous article first.

The Plan

The plan is to synthesize a scaled-down version of a set of tables that would be used in the real-world business case of a customer order system.

Aside from items for purchase, let’s think about what is called for in such a scenario.

Customers – in what is not much of a surprise, if you are going to build a system to track customer orders, you are going to need customers
Credit cards – customers need to pay for things, and in our simplified scenario they can only do so with credit cards
Orders – an order will consist of a customer, a cost, and a credit card for payment

That’s the data we need, so that’s the data we will make. After you go through this, you will probably find ways to make it more robust, more detailed, and more like the real world, which you should be able to go ahead and do on your own.

Imports and Helper Functions

Let’s get started. First, the imports.

from faker import Faker import faker.providers.credit_card import pandas as pd import random from random import randint

Next, let’s write a few helper functions that will be of use a little later on.

def random_n_digits(n): range_start = 10**(n-1) range_end = (10**n)-1 return randint(range_start, range_end) def unique_rand(rands, n): new_int = random_n_digits(n) if new_int not in rands: rands.append(new_int) else: unique_rand(rands, n) return new_int, rands def generate_cost(): cost = '' digits = randint(1, 4) cost += str(random_n_digits(digits)) cost += '.' + str(random_n_digits(2)) return cost

The first function, random_n_digits, will be used to generate a random integer of length n. With attribution to this StackOverflow answer, see the example below:

def random_n_digits(n): range_start = 10**(n-1) range_end = (10**n)-1 return randint(range_start, range_end) print(random_n_digits(3)) print(random_n_digits(5)) print(random_n_digits(10))

745 98435 7629340561

This will come in handy for identifiers such as customer and order numbers.

The next function, unique_rand(), will be used to ensure that a generated identifier is unique to our system. It simply takes a list of integers and an integer representing the length of a new integer to be created, uses the previous function to create a new integer of this length, checks this new integer against the unique list, and if this new integer is also unique, it gets added to the list.

The final function’s utility is given away by its name, generate_cost(). To generate a cost, the function randomly generates an integer between 1 and 4, which will become the length of the dollar place digits string for our generated cost. random_n_digits() is then used to generate an integer of that length. After this, the process is repeated to create a 2 digit integer, which becomes the decimal cents portion of the cost, to the right hand side of the decimal point. These 2 are put together and returned.

Now let’s move on to faking it.

Don’t worry, even Elaine fakes it.

Create Customers

With that, let’s generate the customers. Our 10,000 customers will include the following attributes:

customer ID (cust_id) – generated using the helper functions outlined above
customer name (name) – generated using Faker; use_weighting=True means an attempt is made to have the frequency of generated values match real-world frequencies (“Lisa” will be more frequently generated than will “Braelynn”); the locales denote from where names are being generated
customer address (address) – generated using Faker
customer phone number (phone_number) – generated using Faker
customer date of birth (dob) – generated using Faker
customer note text field (note) – generated using Faker

The code also stores generated unique customer IDs (cust_ids) as a list in order to compare newly-generated IDs with existing to ensure uniqueness. After this, the dictionary which is used to store the customer data is passed into a new Pandas DataFrame, and ultimately stored to a CSV file.

fake = Faker(['en_US', 'en_UK', 'it_IT', 'de_DE', 'fr_FR'], use_weighting=True) customers = {} cust_ids = [] for i in range(0, 10000): customers[i]={} customers[i]['cust_id'], cust_ids = unique_rand(cust_ids, 8) customers[i]['name'] = fake.name() customers[i]['address'] = fake.address().replace('n', ', ') customers[i]['phone_number'] = fake.phone_number() customers[i]['dob'] = fake.date() customers[i]['note'] = fake.text().replace('n', ' ') customer_df = pd.DataFrame(customers).T print(customer_df) customer_df.to_csv('customer_data.csv', index=False)

 cust_id name 0 52287029 Jay Brown 1 85688731 Frédérique Martel 2 95499535 Georges Leclerc 3 28715621 Christian Carpenter 4 94472217 Lorraine Watts ... ... ... 9995 70168635 Léon Couturier 9996 10483280 Vincent Nelson 9997 41868059 Gert Klapp 9998 28049517 Simonetta Garrone 9999 26781527 Alessio Camanni address phone_number 0 Flat 9, Hart islands, East Elliotchester, DY6N... 0117 4960802 1 boulevard Chevalier, 93506 BourgeoisBourg 625.665.4731x5846 2 43 Poole way, Taylorstad, KW45 0FT 0780881522 3 Rotonda Olivetti 99, Sandro salentino, 48332 L... (00960) 04254 4 Jolanda-Seifert-Allee 113, 91518 Koblenz +39 353 5623602 ... ... ... 9995 Strada Molesini 3 Appartamento 32, Ariasso ven... +44(0)1214960433 9996 Trubring 86, 28785 Ansbach +33 5 61 32 08 79 9997 Strada Casarin 01 Piano 8, Settimo Giovanni ne... +39 695 7780253 9998 Marga-Trubin-Straße 2/4, 13495 Feuchtwangen (07310) 491854 9999 avenue Susanne Berthelot, 70292 Poirier-sur-Ra... 0114 4960083 dob note 0 2004-08-20 Interroger dormir but remercier atteindre juge... 1 2009-07-08 Semblable tout désert dominer lutte. Quart mêm... 2 2021-04-17 Occaecati occaecati temporibus a asperiores di... 3 1999-05-03 Rem itaque maxime dolor eum omnis. Eligendi qu... 4 1997-06-02 Doloremque ut illo sunt. Modi non autem conseq... ... ... ... 9995 1981-06-06 Language state white receive soon. Usually tru... 9996 2020-01-03 Similique quasi eos pariatur consequatur liber... 9997 2018-10-13 Voluptatum exercitationem omnis rem. Beatae al... 9998 1983-02-09 Treat vote poor church area discuss carry argu... 9999 1987-02-06 Go remember center toward real food section. S... [10000 rows x 6 columns]

Create Credit Cards

Customers need a method to pay for their orders, so let’s give them all credit cards.

Actually, in an effort to simplify, we will generate credit cards without assigning them to any particular customer. Instead, we will just match customers and cards for orders. You could modify this with a little ingenuity to assign cards to customers and then ensure that orders were paid for with the proper cards. I’ll leave that an an exercise for interested readers.

Below you will find that unique credit card numbers are generated with the same helper functions and same basic method as the unique customer IDs were. The credit card numbers are artificially short, but go ahead and make them as long you would like. The rest of the data is generated using Faker. The data is then fed into a Pandas DataFrame and saved as a CSV file for later use.

credit_cards = {} cc_ids = [] for i in range(0, 10000): credit_cards[i]={} credit_cards[i]['cc_id'], cc_ids = unique_rand(cc_ids, 5) credit_cards[i]['type'] = fake.credit_card_provider() credit_cards[i]['number'] = fake.credit_card_number() credit_cards[i]['ccv'] = fake.credit_card_security_code() credit_cards[i]['expire'] = fake.credit_card_expire() credit_cards_df = pd.DataFrame(credit_cards).T print(credit_cards_df) credit_cards_df.to_csv('credit_card_data.csv', index=False)

 cc_id type number ccv expire 0 33257 JCB 16 digit 213177754612892 121 11/24 1 86707 VISA 16 digit 6573538482942722 042 11/31 2 96668 VISA 16 digit 4780281393619055 671 01/23 3 73749 VISA 16 digit 3520725757002891 319 04/28 4 26342 VISA 13 digit 30141856563149 495 10/29 ... ... ... ... ... ... 9995 14141 VISA 13 digit 4617204802844 640 04/27 9996 35599 Maestro 639006455203 384 12/21 9997 46479 VISA 16 digit 503885514391 587 08/24 9998 78536 VISA 19 digit 4789890563459 057 07/22 9999 84649 Mastercard 3590096870674031 874 04/31 [10000 rows x 5 columns]

Create Orders

Now let’s generate ourselves some money.

Orders will be unique in the same manner as the previous customer IDs and credit card numbers. We will then link a random customer and a random credit card in an order, and generate a random cost using the third of the original three helper functions introduced earlier on.

In what has become a common pipeline, we then create a Pandas DataFrame of the dictionary, and save the data to file as a CSV.

orders = {} order_ids = [] for i in range(0, 1000): orders[i]={} orders[i]['order_id'], order_ids = unique_rand(order_ids, 10) orders[i]['cust_id'] = random.choice(cust_ids) orders[i]['cc_id'] = random.choice(cc_ids) orders[i]['cost'] = generate_cost() orders_df = pd.DataFrame(orders).T print(orders_df) orders_df.to_csv('orders.csv', index=False)

 order_id cust_id cc_id cost 0 9526379779 21484387 95840 6471.85 1 6999189530 90073074 75578 5.31 2 6124881941 84882923 13358 962.21 3 7476579071 91911770 22301 60.82 4 4102308607 60614412 28339 8086.96 .. ... ... ... ... 995 2021016579 42107923 24863 4165.62 996 9279206414 49397693 45436 1.27 997 3378899620 40173623 96470 32.64 998 2222207181 73076539 40697 9701.29 999 1040242247 17749465 66052 9.63 [1000 rows x 4 columns]

The results it that you should have yourself three CSV files constituting the real-world emulation of an actual business process.

What do you do with the synthetic data now? Get creative. You could do some study, learn some new techniques or concepts, or undertake a project. A few more specific ideas include: using Python to create an SQL database out of this data to then practice your SQL skills with; performing a data exploration project; visualizing some of the synthetic data in interesting ways; seeing what kind of data preprocessing you could come up with to perform, such as splitting customer names into first and last, verifying that each customer has a credit card, ensuring young children aren’t able to make purchases.

And just remember: keep on faking it.

Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master’s degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.

Source: https://www.kdnuggets.com/2022/01/fake-realistic-synthetic-customer-datasets-projects.html

Time Stamp: January 11, 2022