From Customer Behavioral Signals to Business Value: A PyTorch ANN Tutorial

A real-world blueprint for data preparation, building an Artificial Neural Network, and measuring business readiness with evaluation metrics. Photo by Shamblen Studios on Unsplash In The Real World Imagine we run a retail company with both physical stores and an online platform. Every day, we face the same question: how do we allocate our marketing budget? Do we push customers toward our app, our stores, or let them decide? Getting this wrong is costly — sending an in-store promotion to someone who only shops online wastes the spend. Targeting a digital campaign at a store-only customer gets zero return. To solve this, we’ll build an Artificial Neural Network (ANN) that reads 28 behavioral and demographic signals about each customer and predicts their shopping preference: Online , Store , or Hybrid . Getting this right unlocks real business value: Personalized targeting — Reach each customer through the channel they actually use Smarter budget allocation — Spend digital dollars on Online shoppers, in-store budgets on Store shoppers, and omnichannel efforts on Hybrid shoppers Churn prevention — Catch Hybrid customers before they drop to a single channel Lifetime value modeling — Shopping preference is a strong predictor of long-term spend By the end of this notebook, we’ll have a model that hits 97.07% overall accuracy — and we’ll understand exactly when it’s right, when it fails, and what those failures cost the business . What Is Multi-Class Classification? In machine learning, classification means teaching a model to assign a label to each example. There are two main types: Our problem is multi-class — every customer belongs to one of three segments, each needing a different strategy: Once we can accurately predict a customer’s segment, we can: Personalize first-touch marketing before the first purchase Assign the right service channel (in-store rep vs. live chat vs. self-service) Design loyalty rewards that match how they actually shop Stop wasting budget on the wrong channel How Does the Network Pick a Class? The network outputs three raw scores — one per class. We simply pick the highest. This is called the argmax . These scores can also be converted into probabilities to show us how confident the prediction is — something we’ll explore in Step 22. About the Dataset Access the dataset using the below link available on Kaggle ANN Shopping Preferences Prediction with PyTorch The Consumer Shopping Trends 2026 dataset has 11,789 customers , each described by 25 behavioral and demographic columns. Think of it as the kind of data a retailer would collect across purchase history systems, loyalty apps, and customer surveys. Note: Phase 1 typically takes 60–80% of total project effort in the real world. A model trained on dirty data fails silently in production — and nobody catches it until revenue takes a hit. Getting the data right is what separates analytics that drives decisions from analytics that collects dust. Let’s get started. Step 1 — Import All Essential Libraries Before we touch any data, we load all the libraries we’ll need. Here’s a quick summary of what each one does: Everything here is open-source — no license fees, runs on any infrastructure, and used across the industry for production ML systems. import numpy as np # NumPy: the foundation of scientific computing in Python. # Provides fast, vectorized array operations and math functions. import pandas as pd # Pandas: our data manipulation workhorse. # Think of it as a turbocharged spreadsheet — we load CSV files, # inspect columns, filter rows, and reshape data with it. import matplotlib.pyplot as plt # Matplotlib: the base library for drawing charts in Python. # Seaborn is built on top of it, but we still use plt directly # to display, title, and save our figures. import seaborn as sns # Seaborn: beautiful, statistical visualizations with minimal code. # Its heatmaps, box plots, and bar charts look polished out of the box. import torch # PyTorch: our deep learning framework. # It provides Tensors (multi-dimensional arrays that can run on GPU) # and automatic differentiation (autograd) needed for backpropagation. import torch.nn as nn # The neural network submodule inside PyTorch. # Contains: Linear (fully connected) layers, activation functions # (ReLU, Sigmoid, etc.), and loss functions (CrossEntropyLoss, etc.) from torch.utils.data import TensorDataset, DataLoader # TensorDataset: pairs up our feature tensor X and label tensor y, # so that index i always returns (X[i], y[i]) — keeps them synchronized. # DataLoader: wraps a dataset and automatically handles splitting into mini-batches, # shuffling the order, and iterating over the data during training. from sklearn.model_selection import train_test_split # A single function that splits our data into training set and test set. # It handles the random shuffling and proportional splitting for us. from sklearn.preprocessing import StandardScaler, LabelEncoder # StandardScaler: scales each numeric feature to have mean = 0 and std = 1. # This is critical — neural networks train much better when features are on the same scale. # LabelEncoder: converts text labels ('Store', 'Online', 'Hybrid') to integers (0, 1, 2). # Neural networks work with numbers, so all labels must be numeric. from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # accuracy_score: fraction of predictions that were correct # confusion_matrix: table showing which classes were confused with which # classification_report: per-class precision, recall, F1-score, and support from sklearn.utils.class_weight import compute_class_weight # Calculates how much to up-weight rare classes and down-weight common ones. # This is our main tool for dealing with the class imbalance we will discover in the EDA. from collections import Counter # Counts how many times each value appears in a list. # We use it to calculate the naive baseline accuracy. import random # Python's built-in random number generator. # Must be seeded separately from NumPy and PyTorch. # ---- Reproducibility ---- # We set random seeds across all sources of randomness so that results are identical # every time this notebook is run — regardless of session or environment. # Without this, PyTorch's random weight initialization would differ on each run, # producing slightly different accuracy and AUC numbers every time. random.seed(42) # Python built-in random np.random.seed(42) # NumPy random torch.manual_seed(42) # PyTorch CPU operations torch.backends.cudnn.deterministic = True # Force deterministic GPU operations torch.backends.cudnn.benchmark = False # Disable auto-tuner that introduces randomness # ---- Plot Style ---- sns.set_style('whitegrid') # Use a clean white grid background for all seaborn plots. plt.rcParams['figure.dpi'] = 100 # Set chart resolution to 100 dots per inch. print("All libraries imported successfully!") print(f"PyTorch version : {torch.__version__}") Step 2 — Load the Dataset We load the file into a Pandas DataFrame — think of it as a spreadsheet in Python where each row is a customer and each column is a feature. DATA_PATH points to where the file lives pd.read_csv() reads the file and stores it as df We immediately print the shape to confirm the data loaded correctly # ---- File Path ---- DATA_PATH = '/kaggle/input/datasets/sohaibdevv/consumer-shopping-behavior-and-preference-study-2026/Consumer_Shopping_Trends_2026 (6).csv' df = pd.read_csv(DATA_PATH) # pd.read_csv reads the CSV file and stores it as a DataFrame. # Each row in the CSV becomes a row in df. # Each column header becomes a named column in df. # Let us immediately confirm how big our dataset is. # .shape returns a tuple: (number of rows, number of columns) print(f"Dataset loaded successfully!") print(f"Rows : {df.shape[0]:,}") # :, adds comma thousands separator for readability print(f"Columns : {df.shape[1]}") print() print("Column names:") for col in df.columns: print(f" - {col}") Step 3 — First Look at the Data Before building anything, we need to understand what we’re working with. Skipping this step is one of the most common mistakes in ML projects — a model trained on misunderstood data will produce misleading predictions. We’ll inspect the data from four angles: Row preview — df.head(5) gives us a quick sanity check on whether the values look reasonable Data types — df.dtypes confirms each column is in the right format (int64, float64, or object) Summary statistics — df.describe() surfaces the range, mean, and spread of every numeric column, helping us spot outliers or features that need scaling Categorical distributions — value_counts() on gender, city_tier, and our target shopping_preference shows us how customers are spread across groups Together, these four checks give us the factual foundation for every design decision that follows. # df.head(5) shows the first 5 rows. # This gives us a quick sanity check — do the column values look reasonable? print("=== First 5 rows ===") df.head(5) # df.dtypes shows the data type of each column. # 'int64' = integer (whole numbers) # 'float64' = decimal numbers # 'object' = text / categorical (Python string) print("=== Column Data Types ===") print(df.dtypes) # df.describe() computes summary statistics for every NUMERIC column: # count = number of non-null values, mean = average, std = standard deviation, # min = smallest value, 25%/50%/75% = percentile thresholds, max = largest value. # This helps us spot outliers, unexpected ranges, or columns that need scaling. print("=== Summary Statistics for Numeric Columns ===") df.describe().round(2) # Let us also look at the three categorical columns specifically. # .value_counts() counts how many times each unique value appears. print("=== gender distribution ===") print(df['gender'].value_counts()) print() print("=== city_tier distribution ===") print(df['city_tier'].value_counts()) print() print("=== shopping_preference distribution (our TARGET) ===") print(df['shopping_preference'].value_counts()) Step 4 — Data Cleaning Real-world data is rarely clean straight out of the box. Before we train anything, we need to make sure our data is trustworthy. Here’s why each issue matters: Missing values — NaN propagates through every calculation in a neural network, causing training to crash or silently produce garbage predictions Duplicate rows — the model ends up training on the same customer multiple times, giving that profile more influence than it deserves We run two checks: Missing value check — scans every column for NaN; prints a warning if any are found Duplicate row check — counts exact duplicate rows and removes them automatically if found A model trained on dirty data fails silently — no error message appears. It’s our job to catch these issues here, before they reach the model. # ---- Check for Missing Values ---- # .isnull() returns a DataFrame of True/False (True where the value is missing). # .sum() counts the True values per column (True = 1, False = 0). missing_per_column = df.isnull().sum() # Only print columns that actually have missing values (saves screen space). missing_columns = missing_per_column[missing_per_column > 0] if len(missing_columns) == 0: print("No missing values found in any column.") else: print("Columns with missing values:") print(missing_columns) # ---- Check for Duplicate Rows ---- # .duplicated() returns a boolean Series: True for rows that are exact duplicates of an earlier row. # .sum() counts how many duplicates exist. num_duplicates = df.duplicated().sum() if num_duplicates == 0: print("No duplicate rows found.") else: print(f"Found {num_duplicates:,} duplicate rows. We will remove them.") df = df.drop_duplicates() print(f"Shape after removing duplicates: {df.shape}") print() print(f"Final dataset shape: {df.shape[0]:,} rows x {df.shape[1]} columns — ready for analysis.") Step 5 — Exploratory Data Analysis (EDA) Before building the model, we need to understand our customers. EDA helps us answer three key questions: Who are our customers? — How are they distributed across the three segments? What signals distinguish each segment? — Do our features actually behave differently across groups? Are any features redundant? — Are we feeding the model duplicate information? 5.1 — Understanding the Target Variable # Count how many customers belong to each shopping preference category. class_counts = df['shopping_preference'].value_counts() class_pct = df['shopping_preference'].value_counts(normalize=True) * 100 # normalize=True returns proportions (0 to 1), multiplied by 100 to get percentages. print("=== Target Class Distribution ===") for label, count in class_counts.items(): pct = class_pct[label] bar = '#' * int(pct / 2) # Simple ASCII bar proportional to percentage print(f" {label:8s} : {count:6,} ({pct:5.1f}%) {bar}") print() print(f" Total: {len(df):,} customers") # --- Visualize the class distribution --- fig, axes = plt.subplots(1, 2, figsize=(13, 5)) # --- LEFT PLOT: bar chart with counts and percentages --- colors = ['#e74c3c', '#3498db', '#2ecc71'] # Red for Store, Blue for Online, Green for Hybrid bars = axes[0].bar( class_counts.index, # x-axis: class names class_counts.values, # y-axis: counts color=colors, edgecolor='black', linewidth=0.7 ) # Annotate each bar with the count and percentage for bar, count, pct in zip(bars, class_counts.values, class_pct.values): axes[0].text( bar.get_x() + bar.get_width() / 2, # x position: center of the bar bar.get_height() + 80, # y position: just above the top of the bar f'{count:,}\n({pct:.1f}%)', # text: count on one line, percentage below ha='center', va='bottom', fontsize=10, fontweight='bold' ) axes[0].set_title('Shopping Preference Class Counts', fontsize=12, pad=15) axes[0].set_xlabel('Shopping Preference', fontsize=11) axes[0].set_ylabel('Number of Customers', fontsize=11) axes[0].set_ylim(0, class_counts.max() * 1.25) # Add headroom above the tallest bar # --- RIGHT PLOT: pie chart to show proportions more intuitively --- axes[1].pie( class_counts.values, labels=class_counts.index, colors=colors, autopct='%1.1f%%', # Show percentage inside each slice, 1 decimal place startangle=140, # Rotate so the biggest slice starts at 140 degrees pctdistance=0.75, # Place percentage labels at 75% of the radius wedgeprops={'edgecolor': 'white', 'linewidth': 1.5} ) axes[1].set_title('Shopping Preference Proportions', fontsize=12, pad=15) plt.suptitle('Target Variable Distribution', fontsize=14, fontweight='bold', y=1.01) plt.tight_layout() plt.show() The first thing we check is how customers are distributed across the three shopping segments. Nearly 9 out of 10 customers prefer physical stores. This is called class imbalance — and it creates two problems we need to address. The modeling problem: A neural network minimizes average loss. With 87% of customers being Store shoppers, the easiest strategy for the model is to predict “Store” for everyone — achieving ~87% accuracy while completely ignoring Online and Hybrid customers. That’s not a useful model. The business problem: Hybrid customers, despite being only 3.1% of the dataset, are potentially the highest-value segment. They shop across more touchpoints, have more purchase opportunities, and are harder to lose to a competitor. Missing them is a real revenue cost — just not one that shows up in overall accuracy. How we fix it: In Step 13, we apply class weights to the loss function, penalizing missed Hybrid predictions more heavily. This forces the model to take rare segments seriously. 5.2 — Behavioral Signals Across Segments # We pick 6 features that intuitively should differ between Online, Store, and Hybrid shoppers. key_features = [ 'daily_internet_hours', # Online shoppers likely use the internet more 'monthly_online_orders', # Online shoppers order online more often 'monthly_store_visits', # Store shoppers visit physical stores more 'tech_savvy_score', # Online shoppers tend to be more tech-savvy 'need_touch_feel_score', # Store shoppers may prefer to physically inspect products 'online_payment_trust_score', # Online shoppers trust digital payments more ] # Create a 2-row x 3-column grid of box plots fig, axes = plt.subplots(2, 3, figsize=(15, 9)) axes = axes.flatten() # flatten() converts the 2D array of axes into a 1D list for easy iteration # Define a consistent color palette for the three classes palette = {'Store': '#e74c3c', 'Online': '#3498db', 'Hybrid': '#2ecc71'} for i, feature in enumerate(key_features): # sns.boxplot draws a box plot for each class side by side # x = the grouping variable (one box per shopping_preference value) # y = the numeric feature we are comparing # hue = colors each box by the same grouping variable sns.boxplot( data=df, x='shopping_preference', y=feature, hue='shopping_preference', palette=palette, ax=axes[i], # Draw into the i-th subplot legend=False, width=0.5, linewidth=1.2 ) axes[i].set_title(feature.replace('_', ' ').title(), fontsize=11, fontweight='bold') axes[i].set_xlabel('') # Remove the redundant x-label from each subplot axes[i].set_ylabel('Value', fontsize=9) # Add a shared title for the whole figure plt.suptitle('Key Feature Distributions by Shopping Preference\n' '(Box = interquartile range, Line = median, Whiskers = 1.5 x IQR)', fontsize=13, fontweight='bold', y=1.01) plt.tight_layout() plt.show() We plot six features we expect to differ most across segments — a mix of behavioral counts and attitude scores. Notice the Hybrid pattern: on every feature, Hybrid customers sit between the Online and Store extremes — reflecting that they genuinely split their behavior across both channels. This also explains why Hybrid is the hardest segment for the model to draw a sharp boundary around, which we’ll see again in the evaluation phase. 5.3 — Correlation Analysis # Select only the 22 numeric columns (exclude 'gender', 'city_tier', 'shopping_preference') numeric_cols = df.select_dtypes(include='number').columns.tolist() print(f"Numeric columns ({len(numeric_cols)} total): {numeric_cols}") # .corr() computes the pairwise Pearson correlation coefficient between every numeric column. # The result is a 22 x 22 symmetric matrix. corr_matrix = df[numeric_cols].corr() plt.figure(figsize=(16, 13)) # sns.heatmap draws the correlation matrix as a colored grid. # cmap='RdBu_r' : red = negative correlation, white = 0, blue = positive correlation # center=0 : ensures white is exactly at zero correlation # annot=False : with 22 features, the text would be too small to read — we use color only # square=True : forces each cell to be square-shaped sns.heatmap( corr_matrix, cmap='RdBu_r', center=0, vmin=-1, vmax=1, annot=False, square=True, linewidths=0.4, linecolor='white', cbar_kws={'label': 'Pearson r', 'shrink': 0.8} ) plt.title('Correlation Matrix — All Numeric Features\n' '(Blue = positive correlation, Red = negative, White = near-zero)', fontsize=14, pad=15) plt.xticks(rotation=45, ha='right', fontsize=12) plt.yticks(rotation=0, fontsize=12) plt.tight_layout() plt.show() # Print the top 10 most correlated pairs (excluding self-correlation on the diagonal) print("\n=== Top 10 Most Correlated Feature Pairs ===") upper_tri = corr_matrix.where( np.triu(np.ones(corr_matrix.shape), k=1).astype(bool) # upper triangle only, k=1 excludes diagonal ) top_corr = upper_tri.stack().abs().sort_values(ascending=False).head(10) for (col1, col2), val in top_corr.items(): print(f" {col1:35s} <-> {col2:35s} r = {val:.3f}") We check whether any of our 28 features are redundant — carrying essentially the same information as another feature. What the heatmap shows: The highest pairwise correlation in the entire dataset is r = 0.025 . By convention, anything below r = 0.10 is negligible. In other words, every feature contributes unique information — nothing is redundant. This means we keep all 28 features. There’s nothing to cut. An important lesson here: near-zero correlation does not mean the features are useless for prediction. Correlation only captures linear, pairwise relationships. A neural network discovers much richer structure: Non-linear thresholds — e.g., tech_savvy_score ≥ 8 might strongly signal Online preference, while scores below 8 show no clear pattern Interaction effects — two features that look weak individually can be powerful together (e.g., high monthly_online_orders and high online_payment_trust_score jointly identifying Online shoppers very reliably) The proof: despite a maximum r = 0.025 across all feature pairs, our model still achieves 97.07% accuracy . That gap is exactly what deep learning is built for. Note: The near-zero correlations across the board also suggest this is a synthetically generated dataset — real customer data always has meaningful correlations (e.g., higher income → higher spend, more tech-savvy → more internet hours). 5.4 — Do Demographics Add Signal? Gender and City Tier We check whether the two categorical features — gender and city_tier — show any meaningful differences across shopping segments. The grouped bar charts here tell us whether demographic variables justify their inclusion in the model. In real projects, this is also a cost question: if a feature adds no signal, it’s not worth collecting. Gender — helps reveal whether different demographic groups need different campaign approaches City Tier — reflects differences in store access and delivery infrastructure; Tier 3 cities have fewer physical stores, which may push customers online by necessity fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # --- Palette for shopping preference --- pref_palette = {'Store': '#e74c3c', 'Online': '#3498db', 'Hybrid': '#2ecc71'} # --- LEFT: Gender vs Shopping Preference --- # We use a grouped bar chart (hue='shopping_preference') to compare proportions within each gender group. # We normalize within each gender group so we can fairly compare groups of different sizes. gender_pref = ( df.groupby(['gender', 'shopping_preference']) .size() # count occurrences for each combination .reset_index(name='count') ) # Calculate the percentage within each gender group gender_total = gender_pref.groupby('gender')['count'].transform('sum') gender_pref['percentage'] = gender_pref['count'] / gender_total * 100 sns.barplot( data=gender_pref, x='gender', y='percentage', hue='shopping_preference', palette=pref_palette, ax=axes[0], edgecolor='black', linewidth=0.5 ) axes[0].set_title('Shopping Preference by Gender', fontsize=12, fontweight='bold') axes[0].set_xlabel('Gender', fontsize=11) axes[0].set_ylabel('Percentage within Gender Group (%)', fontsize=10) axes[0].legend(title='Preference', fontsize=9) # --- RIGHT: City Tier vs Shopping Preference --- tier_pref = ( df.groupby(['city_tier', 'shopping_preference']) .size() .reset_index(name='count') ) tier_total = tier_pref.groupby('city_tier')['count'].transform('sum') tier_pref['percentage'] = tier_pref['count'] / tier_total * 100 sns.barplot( data=tier_pref, x='city_tier', y='percentage', hue='shopping_preference', palette=pref_palette, ax=axes[1], order=['Tier 1', 'Tier 2', 'Tier 3'], # Ensure Tier 1, 2, 3 appear in the correct order edgecolor='black', linewidth=0.5 ) axes[1].set_title('Shopping Preference by City Tier', fontsize=12, fontweight='bold') axes[1].set_xlabel('City Tier (1 = largest / most developed)', fontsize=11) axes[1].set_ylabel('Percentage within City Tier Group (%)', fontsize=10) axes[1].legend(title='Preference', fontsize=9) plt.suptitle('Categorical Features vs. Shopping Preference', fontsize=14, fontweight='bold', y=1.02) plt.tight_layout() plt.show() What the charts show: Looking at both plots, the proportion of Store, Online, and Hybrid shoppers is almost the same across all gender groups (Female, Male, Other) and all city tiers (Tier 1, Tier 2, Tier 3). In other words, knowing a customer’s gender or city tier alone does not help us predict their shopping preference. However, we still keep both features in the model. A feature that looks weak on its own can still contribute when the model combines it with all other 26 features together — the network will automatically learn how much weight to give them. Step 6 — Feature Engineering A neural network only understands numbers . But our dataset has two text columns — gender and city_tier — and a text target column shopping_preference. We need to convert all of them into numeric format before the model can use them. This step covers three conversions: 6.1 — One-hot encode the categorical input features 6.2 — Separate the input features (X) from the target label (y) 6.3 — Encode the target labels as integers 6.1 — One-Hot Encoding: Respecting What Categories Actually Mean The simplest approach would be to assign integers to categories: Female=0, Male=1, Other=2. But this tells the model that “Male” is halfway between “Female” and “Other” on some numeric scale — an ordering that has no meaning and can introduce systematic bias into predictions. One-hot encoding is the correct solution: we create one binary (0/1) column for each unique category. No ordering is implied. Each customer has exactly one “1” in the set of new columns and “0” everywhere else. For city_tier with values 'Tier 1', 'Tier 2', 'Tier 3', the same logic applies — we get city_tier_Tier 1, city_tier_Tier 2, city_tier_Tier 3. Production note: The exact set of encoded columns — including their names and order — must be identical between training and inference. If new customer data arrives with a previously unseen category value, the encoding pipeline will break or silently produce wrong output. This is a real operational risk in live systems. pd.get_dummies() handles the conversion automatically for our tutorial context. # pd.get_dummies() scans the specified columns and replaces each with multiple binary columns. # dtype=int ensures the new columns are 0/1 integers rather than True/False booleans. # We pass a COPY of df so the original DataFrame stays unchanged. df_encoded = pd.get_dummies(df.copy(), columns=['gender', 'city_tier'], dtype=int) print(f"Shape BEFORE encoding: {df.shape}") print(f"Shape AFTER encoding: {df_encoded.shape}") print() # Let us check what new columns were created old_cols = set(df.columns) new_cols = set(df_encoded.columns) added = sorted(new_cols - old_cols) print(f"New columns added by one-hot encoding ({len(added)} total):") for col in added: print(f" + {col}") # Preview a few rows to verify the encoding looks correct print() encoded_preview_cols = added + ['shopping_preference'] df_encoded[encoded_preview_cols].head(4) 6.2 — Separating Features (X) from the Target (y) In machine learning convention: X = the feature matrix — all the input signals the model uses to generate predictions y = the target vector — the outcome we are trying to predict These must be kept completely separate throughout the pipeline because in production, we only ever have X . When a new customer record comes in, we know their demographics and behavior — we do not know their shopping preference. That is precisely what we are trying to predict. Target leakage — accidentally including the target variable or any feature derived from it in X — is one of the most dangerous and hardest-to-detect bugs in production ML systems. A leaked model looks perfect during evaluation and fails completely the moment it sees real customers who have not been labeled yet. # feature_cols = every column except the target column 'shopping_preference' # This gives us 28 columns: 22 original numeric + 6 one-hot encoded feature_cols = [col for col in df_encoded.columns if col != 'shopping_preference'] print(f"Number of input features: {len(feature_cols)}") print() print("All feature columns:") for i, col in enumerate(feature_cols): print(f" {i+1:2d}. {col}") # Extract the feature matrix as a NumPy array # .values converts the DataFrame to a 2D NumPy array of shape (n_samples, n_features) X = df_encoded[feature_cols].values # shape: (11789, 28) # Extract the target column as a 1D array of text labels y_labels = df_encoded['shopping_preference'].values # shape: (11789,) print() print(f"X shape (feature matrix) : {X.shape}") print(f"y_labels shape : {y_labels.shape}") print(f"y_labels sample : {y_labels[:8]}") 6.3 — Encoding Target Labels as Integers: The Final Data Type the Model Needs CrossEntropyLoss — the loss function we will use — requires target labels to be integers, not strings. LabelEncoder converts each unique string label to a unique integer by sorting labels alphabetically and assigning 0, 1, 2, ...: Important: This mapping is set alphabetically by scikit-learn, not by business priority. We must remember and preserve this mapping throughout the entire pipeline — including when we interpret model output, generate reports, and score new customers in production. The le.classes_ attribute stores this mapping so we can always look it up. le = LabelEncoder() # Create a LabelEncoder object — it will remember the mapping. # fit_transform does two things in one call: # 1. fit: learn all unique labels ('Hybrid', 'Online', 'Store') # 2. transform: replace each string with its integer code (0, 1, 2) y = le.fit_transform(y_labels) # shape: (11789,), dtype: int64 print("Label encoding mapping:") for integer_code, string_label in enumerate(le.classes_): # le.classes_ is a sorted array of the original string labels print(f" {integer_code} --> {string_label}") print() print(f"y (encoded) dtype: {y.dtype}") print(f"y sample (first 10): {y[:10]}") print() # Verify the counts match our original class counts unique, counts = np.unique(y, return_counts=True) for code, count in zip(unique, counts): print(f" Class {code} ({le.classes_[code]:8s}): {count:,} samples") Step 7 — Train / Test Split We never train and evaluate a model on the same data. If we did, the model could simply memorize all the training examples and score 100% — without actually learning anything useful for new customers. Instead, we split the data into two separate sets: The test set simulates what happens in production — the model sees customers it has never encountered before. Whatever accuracy we get on the test set is our honest estimate of real-world performance. We also use stratify=y to make sure the proportion of Store, Online, and Hybrid customers is the same in both sets. Without this, we could accidentally end up with almost no Hybrid customers in the test set — which would make our evaluation results look better than they really are. X_train, X_test, y_train, y_test = train_test_split( X, # Feature matrix: all 28 columns y, # Target vector: integer-encoded shopping preference test_size=0.2, # Reserve 20% of the data for the test set random_state=42,# Fix the random seed for reproducibility stratify=y # Maintain class proportions in both splits (crucial for imbalanced data) ) print(f"Training set : {X_train.shape[0]:,} samples x {X_train.shape[1]} features") print(f"Test set : {X_test.shape[0]:,} samples x {X_test.shape[1]} features") print() # Verify that class proportions are preserved in both splits print("Class proportions in training set:") for code, count in zip(*np.unique(y_train, return_counts=True)): print(f" {le.classes_[code]:8s}: {count:5,} ({count/len(y_train)*100:.1f}%)") print() print("Class proportions in test set:") for code, count in zip(*np.unique(y_test, return_counts=True)): print(f" {le.classes_[code]:8s}: {count:5,} ({count/len(y_test)*100:.1f}%)") Step 8 — Feature Scaling Look at the range of values across our features: monthly_income ranges from 15,000 to 250,000 tech_savvy_score ranges from 1 to 10 If we feed these raw numbers into the network as-is, the model will pay far more attention to monthly_income simply because its numbers are much larger — not because it's actually more informative. This makes training slow, unstable, and biased toward high-magnitude features. StandardScaler fixes this by rescaling every feature to the same range: Mean = 0 Standard deviation = 1 After scaling, every feature is on equal footing regardless of its original unit. For example, a scaled value of +1.5 means "1.5 standard deviations above average for that feature" — whether the original value was in the thousands or single digits. One important rule — fit on train, transform on test: We compute the mean and standard deviation from the training set only , then apply those same values to scale the test set. We never fit the scaler on the test set. Why? Because the test set represents future customers who don’t exist yet at training time. If we use test data to compute scaling statistics, we’re secretly letting future information influence our preprocessing — a form of data leakage that makes test performance look better than it really is. scaler = StandardScaler() # Create the scaler object (no transformation yet) # fit_transform on training data: # - fit: compute mean and std for each of the 28 features, using training rows only # - transform: apply the formula (x - mean) / std to each value X_train_scaled = scaler.fit_transform(X_train) # transform on test data: # - Uses the SAME mean and std computed from the training set (NOT re-fit on test data) # - This is essential: our model will see test data the same way it saw training data X_test_scaled = scaler.transform(X_test) print("Scaling complete.") print() print(f"X_train_scaled : shape {X_train_scaled.shape}, dtype {X_train_scaled.dtype}") print(f"X_test_scaled : shape {X_test_scaled.shape}, dtype {X_test_scaled.dtype}") print() # Verify that training features now have approximately mean=0, std=1 print("After scaling (training set statistics):") print(f" Mean of first 3 features : {X_train_scaled[:, :3].mean(axis=0).round(4)}") print(f" Std of first 3 features : {X_train_scaled[:, :3].std(axis=0).round(4)}") print() print("Before scaling (training set statistics for comparison):") print(f" Mean of first 3 features : {X_train[:, :3].mean(axis=0).round(1)}") print(f" Std of first 3 features : {X_train[:, :3].std(axis=0).round(1)}") Step 9 — Convert to PyTorch Tensors PyTorch cannot work directly with NumPy arrays. It uses its own data structure called a Tensor — which looks like a NumPy array but comes with two additional capabilities: GPU acceleration — Tensors can run on a GPU, which performs matrix calculations much faster than a CPU. This becomes critical when training on large datasets. Automatic differentiation — PyTorch automatically tracks every operation on tensors and computes the gradients needed for backpropagation. Without this, we’d have to derive and code every gradient by hand. We convert four arrays in total, and the data types must be exactly right: Using the wrong data type — such as float64 for features or float32 for labels — will cause a type mismatch error at training time. Getting this right now saves debugging time later. # torch.tensor() creates a new PyTorch tensor from a NumPy array. # dtype=torch.float32 ensures 32-bit floats (PyTorch's default for neural network weights). X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32) X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32) # dtype=torch.long is PyTorch's name for int64 (64-bit integers). # CrossEntropyLoss specifically requires long (int64) tensors for class labels. y_train_tensor = torch.tensor(y_train, dtype=torch.long) y_test_tensor = torch.tensor(y_test, dtype=torch.long) print("Tensors created:") print(f" X_train_tensor : shape {list(X_train_tensor.shape)}, dtype {X_train_tensor.dtype}") print(f" X_test_tensor : shape {list(X_test_tensor.shape)}, dtype {X_test_tensor.dtype}") print(f" y_train_tensor : shape {list(y_train_tensor.shape)}, dtype {y_train_tensor.dtype}") print(f" y_test_tensor : shape {list(y_test_tensor.shape)}, dtype {y_test_tensor.dtype}") Step 10 — Mini-Batch DataLoader We now have all 9,431 training customers in one large tensor. Instead of feeding all of them into the network at once, we split them into small mini-batches and process one batch at a time. This is the standard way neural networks are trained. Here’s why mini-batches are the right choice: With batch_size=64 and 9,431 training samples: Each epoch runs 148 mini-batches Each mini-batch produces one weight update Over 150 epochs, that’s 22,200 weight updates in total We need two objects to make this work: TensorDataset — Pairs X and y together so they always stay in sync. When the DataLoader shuffles the order, features and labels move together and never get separated. DataLoader — Wraps the dataset, splits it into 64-sample batches, and shuffles the order at the start of each epoch. Shuffling is important — without it, the model might pick up patterns based on the order customers appear in, rather than their actual features. # Step 1: Pair X and y into a dataset object train_dataset = TensorDataset(X_train_tensor, y_train_tensor) # Accessing train_dataset[0] returns (X_train_tensor[0], y_train_tensor[0]) # Step 2: Wrap the dataset in a DataLoader BATCH_SIZE = 64 # Process 64 samples per weight update step train_loader = DataLoader( dataset=train_dataset, # The paired X, y dataset we just created batch_size=BATCH_SIZE, # Each iteration returns a batch of 64 samples shuffle=True # Shuffle the order of samples at the start of every epoch. # This is important: if we always feed Store samples first, # the model would see them first every epoch and might be biased. ) # How many batches will we have per epoch? num_batches = len(train_loader) # len() on a DataLoader returns the number of batches # = ceil(n_samples / batch_size) print(f"Training samples : {len(train_dataset):,}") print(f"Batch size : {BATCH_SIZE}") print(f"Batches per epoch: {num_batches}") print() # Let us peek at the shape of one batch to confirm everything is correct first_X_batch, first_y_batch = next(iter(train_loader)) # iter() makes the DataLoader iterable, next() grabs the very first batch. print(f"Shape of one X batch: {list(first_X_batch.shape)} (64 samples x 28 features)") print(f"Shape of one y batch: {list(first_y_batch.shape)} (64 class labels)") Step 11 — Neural Network Foundations: Understanding the Tool Before We Deploy It This is the most conceptually important section of the notebook. Before we write a single line of model code, we need to understand what a neural network is , how it learns , and how we design one that fits our problem. In a business context, this understanding matters for two concrete reasons: First , a neural network is only a “black box” if we allow it to be. Analysts who understand how predictions are generated can ask better questions: “Why did the model classify this customer as Hybrid when their purchase history is almost entirely online?” “Is this error type cheap or expensive for our business?” Understanding the mechanics enables those conversations rather than leaving predictions unexplained. Second , every architectural decision — how many layers, how many neurons, which activation function, how we handle class imbalance — has direct consequences for model performance, training cost, and inference speed in production. These are not arbitrary choices. We will explain exactly why we made each one and what we would change if requirements changed. Take your time here. Every step that follows builds on these concepts. 11.1 — What Is a Neuron? The Atomic Unit of a Prediction A neuron is the smallest computing unit in a neural network. It performs three operations in sequence: Step 1 — Receive weighted inputs: Each of the n input values is multiplied by a corresponding weight. A weight is a number the network learns — it represents how strongly each feature should influence this neuron’s output. A weight of 0.0 means “this feature does not matter to this neuron.” A large positive weight means “this feature strongly activates this neuron.” Step 2 — Sum and add a bias: All weighted inputs are summed, then a bias is added. The bias lets the neuron shift its baseline response independently of the inputs — think of it as the minimum activation threshold the neuron requires before “firing.” Step 3 — Apply an activation function: The raw sum is passed through an activation function that determines the neuron’s final output. Business analogy: Think of a senior analyst deciding how likely a customer is to be an Online shopper. The inputs are behavioral signals (internet hours, online orders, trust scores). The weights represent how much each signal influences the verdict. The bias is the analyst’s baseline skepticism. The activation function converts weighted evidence into a usable signal for the next stage of analysis. The crucial difference from a human analyst: the neural network runs thousands of these neurons simultaneously in parallel, each independently learning to detect a different pattern, then combines all their signals to produce the final prediction. 11.2 — What Is a Layer? Hierarchical Pattern Recognition A layer is a group of neurons that all receive the same inputs in parallel, each independently learning to detect a different pattern in those inputs. The outputs of one layer feed directly into the next. Each successive layer transforms data into a progressively more abstract representation — from raw features, to learned behavioral patterns, to final probability scores. Why do we need multiple layers? A single linear layer can only draw flat hyperplanes through our 28-dimensional feature space. It can only classify customers using simple “more of this minus more of that” rules. But the real boundary between Online and Hybrid shoppers is not flat — it is a complex, curved surface shaped by interactions between many features simultaneously. Multiple layers with activation functions build hierarchical decision logic — the same kind of reasoning a good analyst uses: Layer 1 learns low-level signals: “this customer orders frequently online” Layer 2 combines signals: “a customer who orders online and trusts digital payments and is highly tech-savvy is almost certainly an Online shopper” This hierarchical abstraction is the core business value of deep learning over simple rules. 11.3 — Activation Functions: The Source of the Model’s Intelligence Without activation functions, stacking multiple layers adds no value at all. Here is the math that makes this concrete: The two-layer network collapses into a single linear transformation — mathematically identical to one-layer logistic regression. We could stack 100 layers and gain zero additional modeling power. Every layer would be redundant. The business consequence : a purely linear model can only segment customers using simple linear boundaries. It cannot model the interaction effects and threshold behaviors that make real customer data complex. For behavioral segmentation, linear is almost always insufficient. The fix: ReLU (Rectified Linear Unit) ReLU passes positive inputs through unchanged and zeros out negative inputs. Its gradient is exactly 0 or 1 — no shrinkage no matter how many layers we stack. Why not Sigmoid? Sigmoid squashes values into the range (0, 1). Gradients flowing backward through many layers get multiplied by these small fractions repeatedly, shrinking to near-zero in early layers — this is the vanishing gradient problem . The model effectively stops learning in its early layers, wasting most of its representational capacity. ReLU eliminates this problem entirely. Note: We do NOT add an activation after the output layer. CrossEntropyLoss applies Softmax internally during training. During prediction, we simply pick the output with the highest raw score — no activation needed. 11.4 — Designing Our Architecture: Making Principled Decisions About Model Capacity Architecture design is our first major resource trade-off decision. A larger model can learn more complex patterns — but it costs more to train, takes longer to score new customers, is harder to maintain, and needs more training data to avoid memorization. A smaller model is faster and cheaper but might miss important patterns. Our input and output dimensions are fixed by the problem: Input: 28 (one node per feature — non-negotiable) Output: 3 (one node per segment: Hybrid, Online, Store — non-negotiable) For hidden layers, we use the “funnel” pattern — progressively compressing information: Total trainable parameters: 1,856 + 2,080 + 99 = 4,035 For 9,431 training customers, 4,035 parameters is a healthy ratio — large enough to capture the behavioral patterns in the data, small enough to avoid memorizing individual training examples. The resource trade-off is real and scales with model size: A model 10× larger (say, 28→512→256→128→3) might marginally improve performance on this dataset — but it would need significantly more training data to generalize, take longer to train, and add inference latency in production. For a dataset of ~10,000 records, our 4,035-parameter model is the appropriate size. Why powers of 2? Modern hardware — especially GPUs — executes matrix multiplications most efficiently on these dimensions. It costs nothing in accuracy and can meaningfully accelerate training at scale. 11.5 — How the Network Learns: The 22,200-Iteration Feedback Loop Every prediction improvement in our model comes from one repeating cycle: forward pass → measure error → backpropagate → update weights. Understanding this loop explains why training takes the time it does, when to stop, and what to do when it fails. Step 1 — Forward Pass: Data flows left to right through the network: Each layer transforms the data using its current weights. The final output is 3 raw scores. Step 2 — Compute the Loss: We compare predicted scores to true labels using CrossEntropyLoss, which: Applies Softmax → converts 3 raw scores into 3 probabilities (summing to 1.0) Takes the negative log of the probability assigned to the true class Example: true label = “Online” (class 1) Wrong prediction case: The loss is a single number. Minimizing it is the entire goal of training. Step 3 — Backward Pass (Backpropagation): PyTorch automatically computes how much each weight contributed to the loss using the chain rule of calculus. This “blame” is quantified as a gradient . loss.backward() triggers this entire computation automatically. Step 4 — Weight Update: The Adam optimizer uses gradients to nudge every weight in the direction that reduces loss: Step 5 — Repeat This cycle runs for every mini-batch. One full pass through all 148 batches = one epoch . We train for 150 epochs = 22,200 weight updates total. The accumulation trap : PyTorch accumulates gradients by default. We must call optimizer.zero_grad() at the start of each batch to clear the previous batch's gradients, or they will compound and produce incorrect updates. Step 12 — Build the Neural Network We now translate the architecture we designed in Step 11 into actual PyTorch code. We use nn.Sequential — a container that connects layers in the order we specify, passing the output of each layer directly into the next. It's the cleanest way to define a straightforward feedforward network like ours. The model we define here has 4,035 trainable weights across three layers: Each of these 4,035 weights will be updated 22,200 times over the course of training — once per mini-batch, across 150 epochs. # ---- Architecture Hyperparameters ---- NUM_FEATURES = X_train_tensor.shape[1] # 28 — one input neuron per feature HIDDEN_1 = 64 # First hidden layer: 64 neurons HIDDEN_2 = 32 # Second hidden layer: 32 neurons NUM_CLASSES = len(torch.unique(y_train_tensor)) # 3 — Hybrid (0), Online (1), Store (2) print(f"Input features : {NUM_FEATURES}") print(f"Hidden layer 1 : {HIDDEN_1}") print(f"Hidden layer 2 : {HIDDEN_2}") print(f"Output classes : {NUM_CLASSES}") print() # ---- Calculate Total Parameters (for educational purposes) ---- p1 = NUM_FEATURES * HIDDEN_1 + HIDDEN_1 # weights + biases for Layer 1 p2 = HIDDEN_1 * HIDDEN_2 + HIDDEN_2 # weights + biases for Layer 2 p3 = HIDDEN_2 * NUM_CLASSES + NUM_CLASSES # weights + biases for Output layer total_params = p1 + p2 + p3 print("Parameter breakdown:") print(f" Layer 1 (Linear {NUM_FEATURES}→{HIDDEN_1}) : {NUM_FEATURES}×{HIDDEN_1} + {HIDDEN_1} = {p1:,}") print(f" Layer 2 (Linear {HIDDEN_1}→{HIDDEN_2}) : {HIDDEN_1}×{HIDDEN_2} + {HIDDEN_2} = {p2:,}") print(f" Output (Linear {HIDDEN_2}→{NUM_CLASSES}) : {HIDDEN_2}×{NUM_CLASSES} + {NUM_CLASSES} = {p3:,}") print(f" Total trainable parameters : {total_params:,}") print() # ---- Build the Model ---- # nn.Sequential runs each layer in the order listed, feeding the output of one # directly into the input of the next. model = nn.Sequential( nn.Linear(NUM_FEATURES, HIDDEN_1), # Layer 1: receives 28 features, outputs 64 values nn.ReLU(), # ReLU activation: max(0, x) — adds non-linearity nn.Linear(HIDDEN_1, HIDDEN_2), # Layer 2: receives 64 values, outputs 32 values nn.ReLU(), # Another ReLU between layers nn.Linear(HIDDEN_2, NUM_CLASSES), # Output layer: receives 32, outputs 3 raw scores # No activation here — CrossEntropyLoss handles it ) print("Model architecture:") print(model) Step 13 — Handling Class Imbalance From our EDA in Step 5.1, we know that Store shoppers make up ~87% of the dataset. Without any correction, the model quickly learns that predicting “Store” for every single customer achieves ~87% training accuracy with minimal loss — while completely ignoring Online and Hybrid customers. That’s not a useful model. Class weights fix this by telling the loss function to penalize errors on minority classes more heavily. The weight for each class is calculated as: The resulting weights for our three classes are: With these weights applied, the model can no longer afford to ignore Hybrid customers. Every missed Hybrid prediction costs the optimizer nearly 28 times more than a missed Store prediction — forcing the network to take small segments seriously. The trade-off to keep in mind: Higher weights on Hybrid mean we catch more true Hybrid customers (higher recall), but we’ll also mislabel some Store or Online customers as Hybrid (more false positives). We’ll measure exactly how this plays out in the evaluation steps. # compute_class_weight calculates the balanced weight for each class. # 'balanced' = use the formula: n_total / (n_classes * n_class_samples) # classes = the unique integer class labels in the training set # y = the full training label array (needed to count samples per class) class_weights_array = compute_class_weight( class_weight='balanced', classes=np.unique(y_train), y=y_train ) # We must convert to a float32 PyTorch tensor to pass it to CrossEntropyLoss class_weights_tensor = torch.tensor(class_weights_array, dtype=torch.float32) print("Class weights (higher = penalized more when missed):") for code, weight in enumerate(class_weights_tensor): class_name = le.classes_[code] n_train = (y_train == code).sum() print(f" Class {code} ({class_name:8s}): weight = {weight:.4f} " f" (training samples: {n_train:,})") print() print("Interpretation:") print(f" A Hybrid error is penalized ~{class_weights_tensor[0]/class_weights_tensor[2]:.1f}x more than a Store error.") print(f" An Online error is penalized ~{class_weights_tensor[1]/class_weights_tensor[2]:.1f}x more than a Store error.") Step 14 — Loss Function and Optimizer We define two components that control how the model learns: Loss function — measures how wrong the model’s predictions are Optimizer — uses that measurement to update the model’s weights Loss Function: CrossEntropyLoss nn.CrossEntropyLoss is the standard choice for multi-class classification. For each prediction, it: Applies Softmax to convert the 3 raw scores into probabilities Takes the negative log of the probability assigned to the correct class Averages the loss across the mini-batch, applying class weights from Step 13 The higher the model’s confidence in the wrong class, the higher the loss. The higher the confidence in the correct class, the lower the loss. Training pushes the model to be confidently right. Optimizer: Adam torch.optim.Adam uses an adaptive learning rate — parameters that have been updating frequently get smaller steps, while parameters that have barely moved get larger steps. This self-tuning behaviour makes Adam converge faster and more reliably than standard gradient descent. We use the default learning rate of lr=0.001, which is a solid starting point for most problems: Too large → training becomes unstable, loss jumps around Too small → training converges very slowly 0.001 → well-balanced for our architecture ✓ # --- Loss Function --- # weight=class_weights_tensor passes our computed class weights. # When a Hybrid sample (class 0) is misclassified, the loss contribution for that # sample is multiplied by class_weights_tensor[0] (the highest weight). criterion = nn.CrossEntropyLoss(weight=class_weights_tensor) # --- Learning Rate --- LR = 0.001 # Adam's default learning rate. Controls the size of each weight update step. # Too large: training becomes unstable (loss jumps around) # Too small: training converges very slowly # --- Optimizer --- # model.parameters() gives the optimizer a reference to ALL weights and biases # in the model (across all layers), so it can update them all during optimizer.step(). optimizer = torch.optim.Adam(model.parameters(), lr=LR) print(f"Loss function : {criterion}") print(f"Optimizer : Adam (lr = {LR})") print() print("Ready to train!") Step 15 — Training the Neural Network This is where everything we’ve built comes together. We run the training loop for 150 epochs — each epoch is one full pass through all 9,431 training customers across 148 mini-batches, producing one weight update per batch. That’s 22,200 weight updates in total. Every weight update has a cost — compute time, energy, and in production, money. For our 4,035-parameter model on a laptop, 150 epochs runs in minutes. But every architectural choice we’ve made — number of layers, neurons, batch size, epochs — directly determines that cost at scale. A production model serving millions of customers could take weeks and tens of thousands of dollars to train. Getting these decisions right from the start is what makes the difference between a model that is affordable to maintain and one that is not. Every mini-batch follows the same five-step cycle: We print the average loss every 15 epochs. If the loss stops decreasing well before epoch 150, it means the model has either converged early or got stuck — both cases worth knowing before evaluating results. NUM_EPOCHS = 150 # Number of full passes through the training data losses = [] # We collect the average loss per epoch to plot later for epoch in range(NUM_EPOCHS): # Outer loop: repeat NUM_EPOCHS times total_loss = 0.0 # Accumulate the loss across all batches in this epoch num_batches = 0 # Count how many batches we processed for X_batch, y_batch in train_loader: # Inner loop: iterate through mini-batches # ---- Step 1: Clear gradients ---- # By default, PyTorch ACCUMULATES gradients across calls to .backward(). # We must reset them at the start of each batch so we only use the current # batch's gradients for the weight update. optimizer.zero_grad() # ---- Step 2: Forward pass ---- # Pass the 64-sample batch through the entire network: # X_batch (64 x 28) → Linear → ReLU → Linear → ReLU → Linear → y_pred (64 x 3) # Each of the 64 rows gets three raw class scores. y_pred = model(X_batch) # ---- Step 3: Compute loss ---- # criterion is CrossEntropyLoss(weight=class_weights_tensor) # y_pred : shape (64, 3) — model's raw scores for each of 3 classes # y_batch: shape (64,) — true class indices (0, 1, or 2) # The function applies Softmax then computes weighted negative log-likelihood. loss = criterion(y_pred, y_batch) # ---- Step 4: Backward pass ---- # Compute the gradient of 'loss' with respect to every learnable parameter in the model. # PyTorch builds a computational graph during the forward pass and uses it here # to apply the chain rule automatically — this is the power of autograd. loss.backward() # ---- Step 5: Update weights ---- # The Adam optimizer uses the computed gradients to nudge every weight and bias # in the direction that reduces the loss. optimizer.step() total_loss += loss.item() # .item() extracts the Python float from the tensor num_batches += 1 # Average loss for this epoch (average over all mini-batches) avg_loss = total_loss / num_batches losses.append(avg_loss) # Print a progress update every 15 epochs if (epoch + 1) % 15 == 0: print(f"Epoch [{epoch + 1:3d}/{NUM_EPOCHS}] Avg Loss: {avg_loss:.4f}") print() print(f"Training complete! Final average loss: {losses[-1]:.4f}") Step 16 — Training Loss Curve The loss curve is the first thing we check after training. It tells us whether the model actually learned anything — before we look at any other metric. What our training run shows: Loss starts at 0.8869 before training begins — roughly random weighted guesses at epoch 0 Drops sharply to 0.0893 by epoch 15 — the model picks up the main patterns quickly Continues dropping steadily to 0.0004 by epoch 105 A spike to 0.0128 appears at epoch 120, then recovers and settles at 0.0002 by epoch 150 What is the spike at epoch 120? This is normal behavior from the Adam optimizer, called adaptive step overshoot . When the loss is already very small, the gradient signals become very weak and noisy. Adam’s adaptive step sizes occasionally overshoot the minimum slightly, causing a brief jump in loss. The optimizer corrects itself within a few epochs — nothing to worry about. A note on overfitting: Overfitting happens when the training loss keeps dropping but the model stops generalising to new data. We’re only tracking training loss here, but the fact that our test accuracy reaches 97.07% suggests we’re not severely overfitting. For a more complete check, we’d want to also track validation loss during training — a separate held-out set that gives an early warning if training loss and validation loss start to diverge. This is worth exploring as a next step after finishing this notebook. plt.figure(figsize=(10, 5)) # Plot the average loss collected after each epoch # x-axis: epoch number (1 to 150) # y-axis: average cross-entropy loss sns.lineplot( x=range(1, NUM_EPOCHS + 1), # epoch indices starting from 1 y=losses, color='steelblue', linewidth=2 ) # Mark the start and end loss values plt.scatter([1], [losses[0]], color='red', s=80, zorder=5, label=f'Start loss: {losses[0]:.3f}') plt.scatter([NUM_EPOCHS], [losses[-1]], color='green', s=80, zorder=5, label=f'End loss: {losses[-1]:.3f}') plt.xlabel('Epoch', fontsize=12) plt.ylabel('Average Cross-Entropy Loss', fontsize=12) plt.title('Training Loss Over Epochs\n' 'Should decrease and flatten out — confirming the model is learning', fontsize=13) plt.legend(fontsize=11) plt.tight_layout() plt.show() print(f"Loss reduction: {losses[0]:.4f} --> {losses[-1]:.4f} " f"(improved by {(losses[0]-losses[-1])/losses[0]*100:.1f}%)") Step 17 — Generate Predictions on the Test Set Training is done. We now run the model on 2,358 test customers — customers it has never seen during training — and generate a predicted shopping segment for each one. Each customer gets 3 raw scores (one per segment). We take the highest score as the predicted class. Everything we measure in Steps 18–22 is based on these predictions. Two important practices to follow when generating predictions: model.eval() — Switches the model from training mode to scoring mode. Always call this before running predictions — it's the standard practice and prevents subtle bugs if the architecture is ever updated with layers that behave differently during training vs. inference (e.g. Dropout, BatchNorm). torch.no_grad() — Tells PyTorch not to track gradients during this step. We're predicting, not updating weights, so gradient tracking is unnecessary. This reduces memory usage and speeds up inference. model.eval() # Switch model to evaluation mode (best practice before inference) with torch.no_grad(): # Disable gradient tracking — we are predicting, not training # Forward pass on the test tensor # y_test_logits shape: (2358, 3) — 2358 test samples, each with 3 class scores y_test_logits = model(X_test_tensor) # torch.max(tensor, dim=1) finds the maximum along dimension 1 (across the 3 class scores) # It returns: # values = the actual highest score (we store in _ because we do not need it) # indices = the INDEX of the highest score = our predicted class (0, 1, or 2) _, y_pred_tensor = torch.max(y_test_logits, dim=1) # Convert from PyTorch tensor to NumPy array for use with sklearn metrics y_pred = y_pred_tensor.numpy() # Let us inspect a sample of predictions vs. true labels print("Sample of predictions vs. true labels (first 20 test samples):") print() print(f" Predicted (int): {y_pred[:20].tolist()}") print(f" True (int): {y_test[:20].tolist()}") print() print(" Decoded predictions:") for pred, true in zip(y_pred[:10], y_test[:10]): match = 'CORRECT' if pred == true else 'WRONG ' print(f" Predicted: {le.classes_[pred]:8s} | True: {le.classes_[true]:8s} | {match}") Step 18 — Evaluation Metric: Overall Accuracy Accuracy is the most intuitive summary of model performance: Every stakeholder report leads with accuracy because it is immediately interpretable: “our model correctly classifies 97.07% of customers.” That is a clean, compelling headline. But accuracy alone is insufficient for our business problem — and here is exactly why: With Store representing ~87% of customers, a model that predicts “Store” for every single customer — using zero data, no model, no intelligence whatsoever — achieves 86.9% accuracy . Our neural network achieves 97.07% — only 10.17 percentage points better on the headline. At first glance, 10.17 pp of improvement might not sound like much. But this framing misses the entire story. The baseline’s 86.9% is built entirely on correctly guessing Store. Our model’s 97.07% includes meaningful recall for Online (87.7%) and Hybrid (73.0%) — segments that the baseline completely ignores. The difference between 86.9% and 97.07% is not just 10.17 pp of accuracy. It is the difference between a system that can only serve one customer segment and one that can serve all three. This is why we never stop at accuracy for imbalanced business problems. The next three metrics tell the complete and honest story. accuracy = accuracy_score(y_test, y_pred) # y_test: true integer labels (from our test split) # y_pred: predicted integer labels (from our model) # Returns a float between 0.0 (0% correct) and 1.0 (100% correct) n_correct = int(accuracy * len(y_test)) n_total = len(y_test) print(f"Overall Accuracy : {accuracy * 100:.2f}%") print(f"Correct : {n_correct:,} out of {n_total:,} test samples") print() print("Per-class correct predictions:") for code in range(NUM_CLASSES): mask = y_test == code # Boolean mask: True for samples of this class n_class = mask.sum() # Total test samples in this class n_class_correct = (y_pred[mask] == code).sum() # Correct predictions for this class class_acc = n_class_correct / n_class * 100 print(f" {le.classes_[code]:8s} : {n_class_correct:4d} / {n_class:4d} correct " f"({class_acc:.1f}%)") Step 19 — Confusion Matrix Overall accuracy tells us how often the model is right. The confusion matrix tells us what kind of mistakes it makes — which is far more useful for making business decisions. How to read it: Each row is the true class; each column is the predicted class. The diagonal cells are correct predictions. Every off-diagonal cell is an error — showing which class was mistaken for which. Finding 1 — Store is predicted with near-perfect accuracy With 2,049 true Store customers in the test set, the model correctly identifies 2,029 of them (99%) . Only 20 Store customers were misclassified — 19 as Hybrid and 1 as Online. The dominant class is handled almost flawlessly. Finding 2 — Online is well-identified, with errors leaning toward Hybrid Out of 235 true Online customers, 206 were correctly predicted (88%) . The remaining 29 errors all landed on Hybrid (28) and Store (1) — almost no Online customer was confused with Store directly. Finding 3 — Hybrid is the hardest segment, but still reasonably identified Hybrid is the smallest and most difficult class. Out of 74 true Hybrid customers, 54 were correctly predicted (73%) . The 20 misses were split as: 15 predicted as Store 5 predicted as Online Missed Hybrid customers lean toward Store rather than Online — consistent with Store being the dominant class with the most training examples. Why do errors cluster around Hybrid? Hybrid shoppers use both channels by definition, so they naturally share behavioral features with both Online and Store shoppers. The model boundary around Hybrid is inherently blurry — not a flaw in the model, but a reflection of how this customer segment actually behaves. The relatively low Hybrid precision is also a direct result of the class weights we set in Step 13. By heavily penalising missed Hybrid predictions, the model becomes more aggressive about predicting Hybrid — catching more true Hybrid customers but also pulling in some Online (28) and Store (19) customers incorrectly. Whether this trade-off is acceptable depends on the cost of running omnichannel campaigns on the wrong customers — a business decision rather than a modelling one. cm = confusion_matrix(y_test, y_pred) # y_test : true labels # y_pred : predicted labels # Returns a 3x3 matrix; cm[i][j] = number of samples with true class i predicted as class j class_names = le.classes_ # ['Hybrid', 'Online', 'Store'] # ---- Raw count confusion matrix ---- fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: raw counts sns.heatmap( cm, annot=True, # Write the count number inside each cell fmt='d', # Format as integers (not scientific notation) cmap='Blues', # Blue color scale: darker = higher count xticklabels=class_names, yticklabels=class_names, linewidths=1, linecolor='white', ax=axes[0], cbar=True ) axes[0].set_title('Confusion Matrix (Raw Counts)', fontsize=12, fontweight='bold', pad=12) axes[0].set_xlabel('Predicted Label', fontsize=11) axes[0].set_ylabel('True Label', fontsize=11) # ---- Normalized confusion matrix (row-normalized = recall per class) ---- # Dividing each row by its sum gives the recall for each class (fraction correctly identified) cm_normalized = cm.astype('float') / cm.sum(axis=1, keepdims=True) sns.heatmap( cm_normalized, annot=True, fmt='.2f', # Format as decimal with 2 places cmap='Greens', xticklabels=class_names, yticklabels=class_names, linewidths=1, linecolor='white', ax=axes[1], vmin=0, vmax=1 ) axes[1].set_title('Confusion Matrix (Row-Normalized = Recall per Class)', fontsize=12, fontweight='bold', pad=12) axes[1].set_xlabel('Predicted Label', fontsize=11) axes[1].set_ylabel('True Label', fontsize=11) plt.suptitle('Confusion Matrix Analysis', fontsize=14, fontweight='bold', y=1.02) plt.tight_layout() plt.show() # Print interpretation print("How to read the confusion matrix:") print(" - Diagonal (top-left to bottom-right) = correct predictions") print(" - Off-diagonal cells = mistakes (true class vs. predicted class)") print() for i, true_class in enumerate(class_names): for j, pred_class in enumerate(class_names): if i != j and cm[i, j] > 0: print(f" {cm[i,j]:4d} {true_class:8s} samples were incorrectly predicted as {pred_class}") Step 20 — Classification Report The classification report breaks down performance per segment , giving us four metrics for each class: Reading Our Results Through a Business Lens Hybrid — Precision: 0.535 / Recall: 0.730 / F1: 0.617 Our smallest and most strategically important segment. We correctly catch 73% of true Hybrid customers — a meaningful result given that the baseline catches 0%. However, precision is only 0.535 , meaning roughly 1 in every 2 customers we flag as Hybrid is actually from another segment. This is the direct result of the high class weight we set in Step 13 — the model casts a wide net to catch more Hybrid customers, but also pulls in some false positives. The right call here depends on campaign cost: If the Hybrid campaign is high-cost (personalised, high-touch), a 53.5% precision may be too wasteful → consider lowering the class weight to trade some recall for cleaner predictions If the Hybrid campaign is low-cost (email series, digital nudge), catching 73% of a high-value segment likely justifies the false positives → keep or raise the weight for even better recall Online — Precision: 0.972 / Recall: 0.877 / F1: 0.922 Strong, reliable performance. When we predict Online, we are correct 97.2% of the time — almost no wasted digital campaign spend. We catch 87.7% of all true Online customers, with most misses falling into Hybrid rather than Store. This segment is well-covered and ready for targeted digital channel deployment. Store — Precision: 0.992 / Recall: 0.990 / F1: 0.991 Near-perfect. The model identifies Store shoppers with 99% precision and recall. With the largest number of training examples, the Store customer profile is learned in exceptional detail. The dominant revenue base is extremely well covered. Macro vs. Weighted Average — Which One to Report? Always report macro F1 when minority segments matter. The weighted average of 0.973 sounds impressive, but it is largely driven by Store’s near-perfect scores. A stakeholder seeing only “F1 = 0.97” might approve a deployment that actually delivers F1 = 0.617 for Hybrid customers. Macro F1 = 0.843 is the honest answer to: “How well does this model serve all three customer groups?” — and that is the number business decisions should be based on. report = classification_report( y_test, y_pred, target_names=class_names, # Use 'Hybrid', 'Online', 'Store' instead of 0, 1, 2 digits=3 # Show 3 decimal places ) print("=== Classification Report ===") print(report) # Let us also visualize precision, recall, and F1 per class as a grouped bar chart # so we can compare the three classes side by side. report_dict = classification_report(y_test, y_pred, target_names=class_names, output_dict=True) # output_dict=True returns the same information as a Python dictionary instead of a string # Extract per-class metrics (exclude 'accuracy', 'macro avg', 'weighted avg' rows) metrics_data = [] for class_name in class_names: for metric in ['precision', 'recall', 'f1-score']: metrics_data.append({ 'Class': class_name, 'Metric': metric.replace('-', '\n'), 'Value': report_dict[class_name][metric] }) metrics_df = pd.DataFrame(metrics_data) # Pivot to wide format for grouped bar chart pivot_df = metrics_df.pivot(index='Metric', columns='Class', values='Value') fig, ax = plt.subplots(figsize=(11, 5)) pivot_df.plot( kind='bar', ax=ax, color=['#2ecc71', '#3498db', '#e74c3c'], # Green=Hybrid, Blue=Online, Red=Store edgecolor='black', linewidth=0.5, width=0.7 ) # Add value labels on each bar for container in ax.containers: ax.bar_label(container, fmt='%.2f', fontsize=8, padding=2) ax.set_title('Precision, Recall, and F1-Score per Class', fontsize=13, fontweight='bold') ax.set_xlabel('Metric', fontsize=11) ax.set_ylabel('Score (0 to 1)', fontsize=11) ax.set_ylim(0, 1.15) ax.set_xticklabels(ax.get_xticklabels(), rotation=0, fontsize=10) ax.legend(title='Shopping Preference', fontsize=9) ax.axhline(y=0.5, color='gray', linestyle='--', linewidth=0.8, alpha=0.6, label='0.50 reference line') plt.tight_layout() plt.show() Step 21 — Baseline Comparison: The True Business Value of Our Model Before presenting results to stakeholders or requesting deployment budget, we need to answer one fundamental question: did we actually build something valuable, or just something that looks good on a benchmark? The answer requires comparing against the simplest possible baseline — a naïve classifier that always predicts the most common segment, without looking at any features or doing any computation at all. Since Store represents ~87% of the data, this baseline achieves: Accuracy: 86.90% — by predicting “Store” for every single customer Macro F1: 0.310 — despite the high accuracy Hybrid recall: 0% — it never identifies a single Hybrid customer Online recall: 0% — it never identifies a single Online customer This is the cost-of-doing-nothing benchmark. Every dollar spent on data pipelines, model development, compute, and ongoing maintenance is only justified if our model meaningfully outperforms this zero-intelligence baseline. The measurable business value we actually deliver: The macro F1 improvement of +0.533 — from 0.310 to 0.843 — is the clearest measure of how much better we now serve all three customer segments compared to doing nothing. The ROI framing: If the effort to build and maintain this model costs X dollars, the business question is whether identifying 87.7% of Online customers and 73.0% of Hybrid customers — groups the baseline treats as completely invisible — generates more than X dollars in incremental campaign efficiency and customer lifetime value. That calculation requires input from marketing, finance, and customer success. The model’s job is to produce the best predictions possible. This step confirms it does. # ---- Majority-class baseline ---- # Find the most common class in the TEST set test_label_counts = Counter(y_test) most_common_class = test_label_counts.most_common(1)[0][0] # integer label of most common class most_common_count = test_label_counts.most_common(1)[0][1] # how many test samples it has naive_accuracy = most_common_count / len(y_test) # Build a naive prediction array: always predict the most common class y_pred_naive = np.full(len(y_test), fill_value=most_common_class) # np.full creates an array of the same value repeated len(y_test) times print("=== Test Set Class Counts ===") for code, count in sorted(test_label_counts.items()): print(f" {le.classes_[code]:8s} : {count:4d} ({count/len(y_test)*100:.1f}%)") print() print("=== Naive (Always-Store) Classifier ===") print(f" Always predicts: '{le.classes_[most_common_class]}'") print(f" Naive Accuracy : {naive_accuracy * 100:.2f}%") print() print("=== Our ANN Classifier ===") print(f" Overall Accuracy : {accuracy * 100:.2f}%") print() # ---- Side-by-side per-class performance ---- print("=== Per-Class Recall Comparison ===") print(f"{'Class':10s} {'Naive Recall':>14s} {'ANN Recall':>12s} {'Improvement':>12s}") print("-" * 54) for code in range(NUM_CLASSES): mask = (y_test == code) # Naive recall: fraction of true-class-code samples predicted as code by naive model naive_recall = (y_pred_naive[mask] == code).sum() / mask.sum() # ANN recall: ann_recall = (y_pred[mask] == code).sum() / mask.sum() improvement = ann_recall - naive_recall print(f" {le.classes_[code]:8s} {naive_recall*100:>12.1f}% " f"{ann_recall*100:>10.1f}% {improvement*100:>+10.1f}%") print() print("=== Macro F1-Score Comparison ===") from sklearn.metrics import f1_score naive_macro_f1 = f1_score(y_test, y_pred_naive, average='macro') ann_macro_f1 = f1_score(y_test, y_pred, average='macro') print(f" Naive macro F1 : {naive_macro_f1:.4f}") print(f" ANN macro F1 : {ann_macro_f1:.4f}") print(f" Improvement : +{ann_macro_f1 - naive_macro_f1:.4f}") Step 22 — ROC Curve and AUC: Measuring Flexibility Across Every Threshold Every metric we’ve looked at so far was measured at one fixed operating point — predict whichever class has the highest probability. In practice, the right operating point for a business is rarely the default one. Consider this: we want to send a premium, high-cost Hybrid loyalty package to every customer we identify as Hybrid. At the default threshold, we catch 73.0% of true Hybrid customers — but also flag a large number of non-Hybrid customers. If each wrongly targeted customer costs $50 in wasted fulfillment, that’s a real business problem. What if we raised the threshold? Only predict Hybrid when P(Hybrid) > 0.60 instead of “whichever probability is highest.” We’d catch fewer Hybrid customers overall, but the ones we flag would be much more reliably Hybrid. Campaign ROI improves. Conversely, for a cheap email campaign, we might lower the threshold and cast a wider net — catching more true Hybrid customers at the cost of some non-Hybrid contacts. The ROC curve maps out every possible threshold choice simultaneously, showing us exactly what recall and false-alarm rate we get at any operating point we choose. What Is a ROC Curve? ROC stands for Receiver Operating Characteristic . As we sweep the threshold from 1.0 to 0.0: Y-axis — True Positive Rate (TPR / Recall): Of all actual customers in this segment, what fraction do we correctly identify? X-axis — False Positive Rate (FPR): Of all customers who are not in this segment, what fraction do we incorrectly flag? The Area Under the Curve (AUC) summarizes the model’s discriminative power across all thresholds in a single number: The critical advantage of AUC over accuracy and F1: AUC cannot be gamed by class imbalance. A model that always predicts “Store” — no matter how high its overall accuracy — scores AUC = 0.50 on every class. AUC rewards genuine probabilistic separation, not majority-class exploitation. One-vs-Rest (OvR): Making Multi-Class ROC Work ROC was designed for binary problems. For our three segments, we run three independent binary analyses: For each analysis, we use the model’s probability score for that class — not the argmax prediction. This gives us a continuous signal we can threshold at any point. from sklearn.metrics import roc_curve, auc, roc_auc_score from sklearn.preprocessing import label_binarize # We import these here because they were not needed in Step 1's import block. # In practice, it is perfectly fine to import at the point of use. # ---- Step 1: Obtain Softmax Probabilities ---- # Until now we only kept the argmax prediction (the winning class). # For ROC we need the full 3-number probability vector for every test sample. model.eval() with torch.no_grad(): y_test_logits = model(X_test_tensor) # shape: (2358, 3) — raw logit scores y_prob = torch.softmax(y_test_logits, dim=1).numpy() # torch.softmax(x, dim=1): for each row, converts 3 raw scores into 3 probabilities # that are all positive and sum to exactly 1.0. # y_prob[:, 0] = P(Hybrid | features) # y_prob[:, 1] = P(Online | features) # y_prob[:, 2] = P(Store | features) print(f"y_prob shape : {y_prob.shape} — {y_prob.shape[0]} test samples x 3 class probabilities") print(f"Row-sum check: all rows sum to 1.0? {np.allclose(y_prob.sum(axis=1), 1.0)}") print() # Show a sample of the probability outputs alongside predictions and true labels print(f" {'#':>4} {'P(Hybrid)':>10} {'P(Online)':>10} {'P(Store)':>10} {'Predicted':>10} {'True':>10} Match") print(f" {'-'*78}") # Show 3 Store, 3 Online, and 2 Hybrid samples (if available) for variety shown = {0: 0, 1: 0, 2: 0} for j in range(len(y_test)): true_class = y_test[j] if shown[true_class] >= (2 if true_class == 0 else 3): continue pred = le.classes_[y_pred[j]] true = le.classes_[true_class] match = 'OK' if y_pred[j] == y_test[j] else 'WRONG' print(f" {j:4d} {y_prob[j,0]:10.4f} {y_prob[j,1]:10.4f} {y_prob[j,2]:10.4f} {pred:>10} {true:>10} {match}") shown[true_class] += 1 if all(v >= (2 if k == 0 else 3) for k, v in shown.items()): break # ---- Step 2: Binarize Labels for One-vs-Rest ---- # label_binarize turns each integer label into a binary indicator vector. # label 0 (Hybrid) → [1, 0, 0] # label 1 (Online) → [0, 1, 0] # label 2 (Store) → [0, 0, 1] # This lets roc_curve treat each class independently as a binary problem. y_test_bin = label_binarize(y_test, classes=[0, 1, 2]) # shape: (2358, 3) print() print(f"y_test_bin shape: {y_test_bin.shape}") print("Example binarization:") for ex_label in [0, 1, 2]: idx = np.where(y_test == ex_label)[0][0] # first test sample of this class print(f" y_test[{idx}] = {ex_label} ({le.classes_[ex_label]:8s}) → y_test_bin[{idx}] = {y_test_bin[idx]}") # ---- Step 3: Compute and Plot ROC Curves ---- fig, axes = plt.subplots(1, 3, figsize=(15, 5)) colors = ['#2ecc71', '#3498db', '#e74c3c'] # Green=Hybrid, Blue=Online, Red=Store roc_aucs = [] for i, (class_name, color) in enumerate(zip(le.classes_, colors)): # roc_curve() sweeps the threshold from 1.0 to 0.0 and records (FPR, TPR) at each step. # y_test_bin[:, i] = true binary label for this class (1 if belongs here, else 0) # y_prob[:, i] = model's predicted probability for this class fpr, tpr, thresholds = roc_curve(y_test_bin[:, i], y_prob[:, i]) # auc() computes the area under the (fpr, tpr) curve via the trapezoidal rule. roc_auc = auc(fpr, tpr) roc_aucs.append(roc_auc) # Draw the ROC curve axes[i].plot(fpr, tpr, color=color, linewidth=2.5, label=f'AUC = {roc_auc:.4f}') # Diagonal dashed line = random classifier (AUC = 0.50). # Our curve should always lie clearly above this. axes[i].plot([0, 1], [0, 1], 'k--', linewidth=1, alpha=0.5, label='Random classifier (AUC = 0.50)') # Shade the area under the curve so AUC is visually intuitive axes[i].fill_between(fpr, tpr, alpha=0.12, color=color) # Mark the operating point we used (the argmax threshold) as a dot # At this threshold, FPR and TPR correspond to our reported confusion matrix metrics support = (y_test == i).sum() axes[i].set_title( f'{class_name} vs. Rest (test support: {support} samples)', fontsize=11, fontweight='bold') axes[i].set_xlabel('False Positive Rate (FPR)', fontsize=10) axes[i].set_ylabel('True Positive Rate / Recall (TPR)', fontsize=10) axes[i].set_xlim([0.0, 1.0]) axes[i].set_ylim([0.0, 1.05]) axes[i].legend(loc='lower right', fontsize=10) axes[i].grid(True, alpha=0.3) plt.suptitle('ROC Curves — One-vs-Rest (OvR)\n' 'Each plot asks: can the model separate this class from all others?', fontsize=13, fontweight='bold', y=1.03) plt.tight_layout() plt.show() # ---- Step 4: AUC Summary ---- macro_auc = roc_auc_score(y_test_bin, y_prob, average='macro') # average='macro' = simple average of the three per-class AUCs (equal weight per class) print("=== AUC Summary ===") print() for class_name, roc_auc_val in zip(le.classes_, roc_aucs): support = (y_test == list(le.classes_).index(class_name)).sum() bar = '#' * int(roc_auc_val * 50) print(f" {class_name:8s} (n={support:4d}) : AUC = {roc_auc_val:.4f} {bar}") print() print(f" Macro-averaged AUC : {macro_auc:.4f}") print() print("Interpretation:") print(" >= 0.90 : Excellent | 0.80-0.89 : Good | 0.70-0.79 : Acceptable") print(" 0.60-0.69 : Poor | <= 0.50 : Random") What the AUC Scores Tell Us — and What Business Actions They Unlock Hybrid — AUC = 0.9844 (Excellent) This is the most strategically important finding in the entire evaluation. Our classification report showed Hybrid F1 = 0.617 — a number that might make a stakeholder question whether the model is useful for Hybrid targeting at all. The AUC of 0.9844 tells a completely different story. An AUC of 0.9844 means the model’s Hybrid probability scores are extremely well-ordered: a randomly selected true Hybrid customer has a 98.44% chance of receiving a higher P(Hybrid) score than a randomly selected non-Hybrid customer. The model has outstanding discriminative power for Hybrid — it is just that the default argmax threshold doesn’t fully exploit that power for Hybrid alone. Practical translation — how to tune the Hybrid threshold: Online — AUC = 0.9987 (Near-Perfect) Online shoppers have the most distinctive behavioral profiles — high monthly_online_orders, high online_payment_trust_score, high tech_savvy_score — making them the most cleanly separable segment despite representing only 10% of the data. Digital campaigns targeting predicted Online customers will land with very high reliability at any reasonable threshold. Store — AUC = 0.9983 (Near-Perfect) With the most training examples, the model separates Store customers from all others with near-absolute reliability at every threshold. Store-targeted campaigns can be deployed with full confidence. Macro-averaged AUC = 0.9938 Our imbalance-proof measure of overall model quality sits firmly in the “Excellent” range. Compare this to the naïve baseline AUC of 0.50 on every class — the gap between 0.50 and 0.9938 is the concrete, threshold-independent measure of the value our model delivers over doing nothing. Summary: What We Built, What We Found, and What It Means for Business Model Performance at a Glance Key Business Findings 1. Minority segments are now identifiable — and that is the core business value. Before this model: 0% recall for both Online and Hybrid customers. After this model: 87.7% recall for Online, 73.0% for Hybrid. These segments, previously invisible to our channel strategy, are now actionable. 2. Class weighting is a business decision, not just a technical one. The penalty ratios are 27.8× for Hybrid and 8.7× for Online relative to Store. These ratios produced Hybrid recall = 73.0% and precision = 53.5%. The right values depend on the relative cost of a missed Hybrid customer vs. a wrong-channel campaign sent to a Store customer. The model gives us the output; business judgment determines the right operating point. 3. The confusion pattern reflects real behavioral overlap — not model failure. Both Online (28 cases) and Store (19 cases) errors land on Hybrid, not on each other. Of Hybrid’s own 20 misses, 15 are predicted as Store and 5 as Online. Only 2 total errors exist between Online and Store directly, confirming those two extreme profiles are cleanly separated. 4. AUC reveals that Hybrid is a much stronger segment than F1 alone suggests. Hybrid F1 = 0.617 — modest. Hybrid AUC = 0.9844 — excellent. A true Hybrid customer has a 98.44% chance of receiving a higher P(Hybrid) score than a non-Hybrid customer. The argmax threshold is a conservative operating point — lowering it allows us to substantially improve Hybrid recall for lower-cost campaigns, a lever the F1 score alone would never reveal. 5. Training cost scales with architecture size — start right-sized. Our 4,035-parameter model trains in minutes and delivers strong results. A larger architecture requires proportionally more training data, longer training time, and higher inference cost. Always measure the performance gain before scaling up model complexity. Complete Pipeline Reference Things to Experiment With We have a strong working baseline. The natural next question is: can we do better, and is the improvement worth the effort? Here is an honest assessment of each improvement path — what it gains, what it costs, and when it makes sense to pursue it. Architecture Changes The key rule: More capacity only helps if we have enough data to fill it. 4,035 parameters on 9,431 customers is already a healthy ratio. Doubling the parameters on the same data risks producing a model that memorizes training examples rather than learning generalizable patterns — and delivers worse production performance than our current model. Training Changes More epochs ( NUM_EPOCHS = 300) — Only useful if the loss is still declining at epoch 150. Ours has essentially flattened, so more epochs would waste compute without improving the model. Always check the loss curve before increasing epochs. Learning rate ( lr = 0.0001 or lr = 0.01) — The most sensitive hyperparameter. Too high: training oscillates or diverges. Too low: converges very slowly and may settle at a suboptimal point. Always plot the loss curve after any learning rate change — it will tell you immediately whether the adjustment helped. Batch size ( batch_size = 32 or 128) — Larger batches produce smoother gradients and fewer updates per epoch. Smaller batches add more noise — which sometimes helps escape poor local minima. Try both and compare the loss curve shape and final evaluation metrics. Data Strategy — Addressing the Root Cause SMOTE (Synthetic Minority Oversampling) — Instead of class weights, generate synthetic Hybrid and Online training samples using the imbalanced-learn library. This directly expands the minority segment data rather than compensating for its scarcity with loss penalties. Expected benefit: better Hybrid precision without sacrificing recall as severely. This is often the highest-ROI improvement when minority class performance matters most. Class weight tuning — Our 10.66× Hybrid weight is the formula default, not a business optimum. Try Hybrid weights of 5×, 8×, 15×, and 20×. Map out how precision and recall change, then choose the weight that matches your campaign cost structure. This is a low-effort experiment with direct business impact. 5-fold cross-validation — Our current single train/test split gives one estimate of production performance. With 5-fold CV, we train five separate models on different data splits and average the results — a substantially more reliable measure of real-world generalization. This is the standard practice before any production deployment decision. Deeper Exploration Track validation loss during training — Split 10% of training data into a held-out validation set and plot both training loss and validation loss per epoch. A widening gap between the two is the early signal for overfitting, and tells us exactly when to stop training. This single addition transforms “train blindly for 150 epochs” into “stop at the optimal point.” Threshold tuning using the ROC curve — The ROC curve from Step 22 gives us the TPR and FPR at every possible threshold for each segment. Use this to find the Hybrid threshold that maximises campaign ROI for your specific cost structure — rather than defaulting to argmax for all deployment scenarios. Different optimizers — Compare Adam vs. SGD with momentum vs. RMSprop. Performance differences on structured tabular data are often small, but the experiment builds intuition about optimizer behaviour and can occasionally reveal meaningful gains. The most important experiment principle: Change one thing at a time and compare against the same baseline. If we simultaneously change architecture, learning rate, and batch size, we can’t know which change produced the improvement — or which caused a regression. Disciplined, one-variable experimentation is what separates an analyst who gets better results from one who just gets different results. From Customer Behavioral Signals to Business Value: A PyTorch ANN Tutorial was originally published in DataDrivenInvestor on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read Original Article →

Source

https://medium.datadriveninvestor.com/from-customer-behavioral-signals-to-business-value-a-pytorch-ann-tutorial-c0115cf49a10?source=rss----32881626c9c9---4