I’m seeing 98–99% accuracy on my validation set, but when I test on truly unseen data, the performance drops significantly. Suspecting some leakage but not sure where it’s happening
DavidBegginer
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
I once trained a model that was performing way too well on the validation set — like, suspiciously good. At first, I was excited… but something felt off. Turned out, it was data leakage.
Here’s what I did to figure it out:
Lesson learned: if your model feels like it’s “too perfect,” always check for leakage. It’ll save you a ton of headaches later. Adopt this, It may solve this problem.
Absolutely, I’ve been in that spot, getting 98 to 99 percent accuracy on validation and feeling confident, only to see the performance drop a lot on truly unseen data. That’s usually a sign of data leakage. What helped me was carefully checking my data splits to make sure training and validation sets didn’t overlap. I also reviewed my features to find anything that might accidentally reveal the target. Sometimes a feature acts like a shortcut without you realizing it. I looked for very high correlations between features and the label because if something is almost perfectly correlated, that’s suspicious.
Finally, I tried a simple model. If it also performed too well, it was another clue leakage was happening. Fixing these things usually made validation accuracy drop, but then the results matched real-world performance better, which is what really matters.