332 entries in. The deduplication problem nobody warns you about.

#buildinpublic #showdev #webdev

The pipeline hit 332 tracked releases last week. I thought that was a milestone worth celebrating until I looked at the dedup stats.

Turns out 23 of those "distinct" entries were the same model release, just named differently across sources. "Llama-3.1-8B-Instruct" and "Meta-Llama-3.1-8B-Instruct" and "llama3.1:8b" all referring to the exact same thing. My naive string-matching dedup was silently failing for months.

The way I found out: I was hand-checking a batch and noticed three entries in the feed that were clearly the same release. Dug into the DB. Found 23 collision clusters. The worst one had 7 variants of the same model across different sources.

The fix wasn't complicated — normalized form comparison, slug the model name, strip vendor prefixes, lowercase everything before comparing. Took about 90 minutes to implement and run a migration.

But here's the part that actually stung: I had been using "332 releases tracked" as a public number. Now it's 309 once you deduplicate properly. That's a 7% correction on the headline metric I'd been citing.

I updated the site. I don't love seeing a smaller number but I'm not going to leave it wrong because it looked better.

This is the unglamorous part of building a data pipeline. Not the clever architecture decisions or the interesting parsing challenges — it's auditing the numbers you've been confidently displaying and realizing they were slightly off.

The fix is in. The count is 309. That's the real number.

Worth writing down because next time I add a new data source, I now have a test: run it against the known collision clusters and confirm dedup is working before the entries hit the feed. The 23-cluster bug was entirely preventable if I'd had that check from the start.

DEV Community

332 entries in. The deduplication problem nobody warns you about.

Top comments (0)