Computer VisionCNNEvaluationExperiment

Snake Detector: evaluation-first CNN experiment

This page is the engineering story behind the repo: a small, noisy classification problem where the main risk is fooling yourself with headline accuracy. The proof is in the workflow and artifacts, not a single leaderboard score.

Problem framing

With limited images, class imbalance, and imperfect labels, the model can look fine on aggregate accuracy while failing on the species that matter for real use. The goal of this project was to keep every training run comparable and to force error analysis before chasing bigger architectures.

Artifacts (what gets saved)

Figure: reproducible evaluation loop (same splits and artifacts every run)

Artifact	Purpose
Stratified train/val split	Keep class ratios stable so metrics reflect generalization, not a lucky split.
Augmentation policy (logged)	Make image-level changes comparable across runs instead of silently drifting.
Confusion matrix + per-class review	Surface which species are confused before touching model depth or width.
Run folder (config + metrics snapshot)	Reproduce any reported number without guessing which code version produced it.

Outcome signal

Technical outcome: a repeatable loop where poor classes show up in structured review instead of hiding behind a single accuracy number. The repo is the source of truth for scripts and training flow; plug in your own metrics exports or confusion matrices as you iterate.