Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
What's the best way to normalize data without leaking info from the test set into the training process?
When I first started working with machine learning, I made this classic mistake: I normalized my entire dataset before splitting it. And guess what? My model performed great a little too great. 😅 Turns out, I was leaking information from the test set into training without even realizing it. Here’s wRead more
When I first started working with machine learning, I made this classic mistake: I normalized my entire dataset before splitting it. And guess what? My model performed great a little too great. 😅
Turns out, I was leaking information from the test set into training without even realizing it.
Here’s what I do now (and always recommend):
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
That way, your model learns only from the training data — just like it would in a real-world setting. No sneak peeks at the test set.
See lessTrust me, once you catch this, you’ll never scale data the old way again. It’s a small thing, but it makes a huge difference in keeping your model honest.