Fake Sales Data¶

To get the data I use in my examples, you can download a raw csv of the data here, which consists of fake sales data for 2023 and 2024. If you do not trust random downloads from the internet, you can use the following python script (below or download here). This script uses the faker and pandas modules, in order to create the same type of data that I use. The data will not be 1:1 as we are both executing a psudoramdom generation of it, but the headers will be the same.

Script function¶

The script creates a dataset of simulated sales data by combining different pieces of information. It starts by creating unique identifiers for each sale, like a transaction ID. Then, it generates random dates and customer names to go along with those IDs. These will be important for pivot tables and data analysis functions. Next, the script randomly selects categories and products from predefined lists. For example, it might choose an electronics category and then pick a product like a laptop or smartphone. The quantity of the item sold is also randomly generated, as well as the price per unit. It then calculates the total sales amount by multiplying the quantity by the price per unit. Finally, it saves all this data into a CSV file, creating a large dataset that's perfect for our use cases here. We could easily adjust this to be medial records, gdp, temperature, finance, headcount, salary, or many other business use cases.

The result of the script will be a csv called sales_data.csv and contain 25,000 rows (unless you specified otherwise). I will have two of these files, one with "2021+2022" data for use in the pandas/merge and pandas/groupby pages, of which can also be downloaded here.

Example Data¶

After running it, you'll have this csv and this type of data:

Transaction_ID,Date,Customer_Name,Region,Product_Category,Product,Quantity,Price_per_Unit,Total_Sales
09792c2c-fee9-425f-8b7f-13b2e5a0b854,2024-08-27,Jesse Gibbs,West,Books,Table,19,100,1900
839d70a3-8621-428a-a3f6-3a43995e466f,2022-12-15,Anthony Torres,East,Clothing,Shirt,13,20,260
286a6ffb-5c0e-4ce7-a5fe-4ba874d52df8,2024-11-13,Dr. Patrick Meza,South,Furniture,Fiction,1,20,20
d561f11c-d446-4b06-99f0-f32cf964af60,2024-03-28,Lori Tyler,South,Furniture,Vegetables,10,100,1000
dc035123-334c-43f8-b706-c5228dc11a2d,2023-03-10,Madison Whitney,South,Clothing,Headphones,14,500,7000
8d72bd77-66fa-409c-b204-4fbc3e0d2092,2023-09-25,Amber Mcconnell,East,Electronics,Jacket,16,20,320
89a6cce0-79fc-4c74-8c70-823dfb35faea,2024-06-20,Christina Scott,North,Clothing,Shoes,12,500,6000
cf1dc351-3b1a-4343-a6ab-270a69d0439b,2024-12-10,Michael Hernandez,South,Food,Shirt,15,200,3000
1d84f956-5b99-4512-a17e-8989a7b59c34,2024-01-25,Phillip Robinson,West,Furniture,Magazine,3,100,300
1659fa3c-a926-47d3-9f4e-745380c2e6f3,2023-05-10,Gregory Clark,North,Electronics,Fruits,20,200,4000

Script¶

#!/usr/bin/env python3

import pandas as pd
import random
from faker import Faker

# Initialize Faker
fake = Faker()

# Parameters
num_rows = 25000  # Number of rows in the dataset

# Generate sample data
data = {
    "Transaction_ID": [fake.uuid4() for _ in range(num_rows)],
    "Date": [fake.date_between(start_date="-2y", end_date="today") for _ in range(num_rows)],
    "Customer_Name": [fake.name() for _ in range(num_rows)],
    "Region": [random.choice(["North", "South", "East", "West"]) for _ in range(num_rows)],
    "Product_Category": [random.choice(["Electronics", "Furniture", "Clothing", "Food", "Books"]) for _ in range(num_rows)],
    "Product": [
        random.choice(
            {
                "Electronics": ["Laptop", "Smartphone", "Tablet", "Headphones"],
                "Furniture": ["Chair", "Table", "Sofa", "Bookshelf"],
                "Clothing": ["Shirt", "Jeans", "Jacket", "Shoes"],
                "Food": ["Snacks", "Beverages", "Fruits", "Vegetables"],
                "Books": ["Fiction", "Non-fiction", "Textbook", "Magazine"],
            }[category]
        )
        for category in random.choices(["Electronics", "Furniture", "Clothing", "Food", "Books"], k=num_rows)
    ],
    "Quantity": [random.randint(1, 20) for _ in range(num_rows)],
    "Price_per_Unit": [
        random.choice([20, 50, 100, 200, 500]) for _ in range(num_rows)
    ],
    "Total_Sales": lambda row: row["Quantity"] * row["Price_per_Unit"]
}

# Convert data into a DataFrame
df = pd.DataFrame(data)

# Calculate Total_Sales
df["Total_Sales"] = df["Quantity"] * df["Price_per_Unit"]

# Save the dataset to a CSV file
output_file = "sales_data.csv"
df.to_csv(output_file, index=False)

print(f"Dataset with {num_rows} rows saved to {output_file}")