Naming Wombie

The other day, I had a problem. Wombie was coming soon, but we still didn’t have a name picked out of the baby-to-be! Or, rather, we didn’t have a sense of 3ish names we most liked (so that we could see which fit him best at the hospital).

I whipped up a small app for Sarah and I to go through names we liked, submit new ones, and vote pairwise in a hot-or-not to figure out what we liked best.

Check it out! Read the README.

Here’s the story behind the scenes:

There were some strong contenders (Omri, Amit, Amitai, Alon) that were sort of tied for us. But maybe if we used the power of statistics, we’d find out that one was secretly stronger than the others as expressed by our votes?

This quickly spun out of control. I created a new version for family to vote for. Then another, public version for friends at large to vote.

Meanwhile, I was hard at work on creating a leaderboard. And it led to some challenges!

The first few decisions were relatively simple:

Simple vote counting didn’t work well. If my mom voted for X over Y over and over again, then that should flatten out to one vote for X over Y, right? Easy enough.
But what if someone voted for name X over Y 3 times, but Y over X 1 time? Does X get 3/4 of a vote? A full vote?
How do we represent the leaderboard? Aren’t there algorithms to figure out winners of pairwise matchups? I hear Elo is good…

But then it got complicated.

First off, user submitted names really mucked things up:

People submitted some names that I liked, some names I disliked, and some names that were clearly trolling.
If I didn’t propagate the names to other voters for consideration, then the point of submission was lost. But then junk names kept polluting the voting.
I had to invent a coefficient to allow user-submitted names to propagate, but slower than hardcoded names.
I also had to implement a blocklist for troll names.
If someone submitted a name and kept voting for it, that name would get a perfect win/loss record until it propagated

And it turns out that finding the “true” winner of pairwise unordered matchups by judges who judged a highly variable number of matchups each — is weirdly complex.

Heavy is the head the chooses the crown:

Bad or trollish user-submitted names kept dominating the rankings. As a backstop, I implemented two filters: filter out names with only one voter voting for them, and just filter out user-submitted names totally.
Turns out the Elo rankings care about order because they model candidate names as players who could change in skill over time. Oops! Out with Elo
I thought about my Integrity Institute days and the mighty power of PageRank. If a candidate name was a domain, and losing to another name was a “link” to that name, we could model a bootstrapped way to find network centrality with untrusted actors!
Some searching found Bradley-Terry rankings. Apparently they’re made for unordered pairwise matches?
I tried on Eigenvector Centrality (though, honestly, I don’t quite understand it) as a generalized variant of PageRank.
And, despite all the fancy stats, I realized that I needed simple win/loss ratios just to sanity check!

And here’s how I made it, technically:

Val.town is an amazing platform for focusing on prototyping an app rather than worrying about tooling, deployment, etc. Big fan!
I used a lot of LLM help! First, ValTown’s in-house “Townie” app. Then Cursor.
LLM’s are kinda dumb. I had to keep rescuing it from mistakes. But fun! Turns out I was semi vibe coding before I knew what it was.
I used Cursor to help think through different statistical methods. But I was careful about errors in implementation. I actually had a python code test suite for the data, and also a javascript one. I figured that python code would be more canonical for the LLM, and more likely to be a true implementation of the concept (and I kept prodding it for that to be true). Then I could check the fidelity of the javascript (which is the language of ValTown) to the python test suite
As I made the app more and more complex (better logging! Usernames and user stats) I had to create a separate admin panel app just to spot check and edit data, upgrade from v1 of logging structures to v2, etc.
Each algorithm showed different winners. The private, family version found different winners than public.

As the README puts it:

This is also an exploration and tutorial in the world of ranking and statistics. Specifically —

With messy or imperfect data, even algorithms meant to account for it give different results

The power of regularization. Throw out a few rogue actors / outlier data and things become a lot clearer

Rather than put data into a black box algorithm and call it a day: interrogate the results!

And

Data is important, analyzing it is helpful, data sense to interrogate the problem is necessary — but at the end of the day, making decisions needs to be informed by data, not mandated.

In the end, Sarah and I spent the first four days in the hospital looking at the little baby, thinking about what he looked like, and also what kind of expectations we wanted to put on him. What name would work well for a child as well as a man? We made our choice based on instinct and reason — but not before I peeked at the leaderboard to make sure that Omri was among the best performers.

Leave a Reply Cancel reply