import pandas as pd
import numpy as np
df = pd.read_csv("creditcard.csv")
## [](#always-check-this-first)Always check this first
print(f"Dataset shape: {df.shape}")
print(f"\nClass distribution:")
print(df["Class"].value_counts())
print(f"\nFraud rate: {df['Class'].mean():.4%}")
print(f"\nMissing values: {df.isnull().sum().sum()}")
2. Baseline Failure Demonstration
Before we build anything, let's prove why accuracy is meaningless here.
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X = df.drop("Class", axis=1)
y = df["Class"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
## [](#a-model-that-predicts-majority-class-every-time)A model that predicts majority class every time
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)
y_pred = dummy.predict(X_test)
print(f"Dummy model accuracy: {accuracy_score(y_test, y_pred):.4%}")
3. Evaluation Framework
The correct metrics for fraud detection are:
from sklearn.metrics import (
classification_report,
roc_auc_score,
confusion_matrix,
precision_score,
recall_score,
f1_score
)
def evaluate_model(model, X_test, y_test, model_name):
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print(f"\n{'='*50}")
print(f"Model: {model_name}")
print(f"{'='*50}")
print(f"\nAUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred,
target_names=["Legitimate", "Fraud"]))
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Here is what each metric means in fraud context:
- **AUC-ROC** β measures how well the model separates fraud from legitimate transactions across all thresholds. 1.0 is perfect, 0.5 is random guessing. This is your primary metric.
- **Recall** β of all actual fraud cases, how many did we catch? Missing real fraud is the most costly mistake. Prioritize this.
- **Precision** β of all predicted fraud cases, how many were real? Low precision means too many false alarms blocking legitimate customers.
- **F1 Score** β harmonic mean of precision and recall. Good overall measure when you need to balance both.
### 4. Preprocessing & Stratification
```python
from sklearn.preprocessing import StandardScaler
X = df.drop("Class", axis=1)
y = df["Class"]
## [](#stratify-ensures-both-splits-maintain)Stratify ensures both splits maintain
## [](#the-same-fraud-ratio)the same fraud ratio
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # Critical for imbalanced data
)
## [](#scale-features)Scale features
## [](#fit-only-on-training-data-never-on-test-data)Fit only on training data β never on test data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Training set fraud rate: {y_train.mean():.4%}")
print(f"Test set fraud rate: {y_test.mean():.4%}")
Stratify ensures both splits have the same fraud rate. Without it you might accidentally create a test set with no fraud cases at all.
5. Approach 1: Class Weights (Logistic Regression)
The simplest approach. Tell the model to penalize misclassifying fraud cases more heavily.
from sklearn.linear_model import LogisticRegression
## [](#without-class-weights-baseline)Without class weights β baseline
lr_baseline = LogisticRegression(
random_state=42,
max_iter=1000
)
lr_baseline.fit(X_train_scaled, y_train)
evaluate_model(lr_baseline, X_test_scaled,
y_test, "Logistic Regression (No Weights)")
## [](#with-class-weights-handles-imbalance)With class weights β handles imbalance
lr_weighted = LogisticRegression(
class_weight="balanced", # This is the key change
random_state=42,
max_iter=1000
)
lr_weighted.fit(X_train_scaled, y_train)
evaluate_model(lr_weighted, X_test_scaled,
y_test, "Logistic Regression (Balanced)")
class_weight="balanced" automatically calculates weights inversely proportional to class frequencies. Fraud cases get much higher weight so misclassifying them costs more.
6. Approach 2: Random Forest with Class Weights
Tree-based models handle imbalance better than linear models and support class weighting too.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=100,
class_weight="balanced",
random_state=42,
n_jobs=-1 # Use all CPU cores
)
rf.fit(X_train_scaled, y_train)
evaluate_model(rf, X_test_scaled,
y_test, "Random Forest (Balanced)")
Random Forest typically outperforms Logistic Regression on fraud detection because fraud patterns are highly nonlinear.
7. Approach 3: SMOTE Oversampling
SMOTE (Synthetic Minority Oversampling Technique) creates synthetic fraud samples to balance the dataset.
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
## [](#install-pip-install-imbalancedlearn)Install: pip install imbalanced-learn
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(
X_train_scaled, y_train
)
print(f"Before SMOTE: {y_train.value_counts().to_dict()}")
print(f"After SMOTE: {pd.Series(y_train_resampled).value_counts().to_dict()}")
rf_smote = RandomForestClassifier(
n_estimators=100,
random_state=42,
n_jobs=-1
)
rf_smote.fit(X_train_resampled, y_train_resampled)
evaluate_model(rf_smote, X_test_scaled,
y_test, "Random Forest + SMOTE")
Important β apply SMOTE only to training data, never to test data. You want to evaluate on real distribution, not synthetic data.
8. Threshold Tuning
By default scikit-learn uses 0.5 as the fraud threshold. This is almost never optimal for imbalanced problems.
import numpy as np
from
(Note: Original source truncates here. In practice, threshold optimization involves iterating over probability cutoffs [0.1, 0.2, ..., 0.9] and selecting the value that maximizes F1-Score or business-specific cost functions.)
Pitfall Guide
- Accuracy Trap: Optimizing for accuracy in a 0.17% fraud dataset guarantees a model that predicts the majority class exclusively. Always use AUC-ROC, Recall, and Precision.
- Data Leakage via Scaling/SMOTE: Fitting
StandardScaler or SMOTE on the full dataset before splitting leaks test distribution statistics into training. Always fit_transform on training data only, then transform on test data.
- Ignoring Stratified Splits: Standard random splits can drop minority samples entirely from validation folds.
stratify=y preserves the exact class ratio across splits.
- Default 0.5 Decision Boundary: Scikit-learn's
predict() uses 0.5. In fraud detection, lowering the threshold (e.g., to 0.2 or 0.3) drastically increases Recall at the cost of Precision. Tune thresholds using predict_proba() and business cost matrices.
- SMOTE on Test Data: Generating synthetic samples for evaluation inflates metrics artificially. SMOTE must only be applied inside cross-validation folds or strictly on the training split.
- Precision-Recall Misalignment: Maximizing Recall without bounding Precision causes alert fatigue. Implement a minimum Precision floor (e.g., β₯0.70) before optimizing Recall further.
- Ignoring Feature Leakage: Variables like
Time and Amount often contain future information or require domain-specific transformation. Always validate feature causality before modeling.
Deliverables
- π Fraud Detection Pipeline Blueprint: End-to-end architecture diagram covering data ingestion, stratified splitting, leakage-safe preprocessing, ensemble modeling, threshold optimization, and monitoring drift.
- β
Production Readiness Checklist: 24-point validation covering metric alignment, class weight configuration, SMOTE isolation, threshold tuning protocols, and False Positive/Negative cost mapping.
- βοΈ Configuration Templates: Ready-to-use YAML/JSON configs for
class_weight tuning, SMOTE hyperparameters (k_neighbors, sampling_strategy), and dynamic threshold optimization scripts compatible with scikit-learn pipelines.