I’ve been thinking about data visualization tools lately. In particular, I got some advice to checkout Google’s Fusion Tables. I needed some data to start playing around with, but, luckily, I happened to have 29,035 OKCupid profiles laying around in a database (learn how I got them here).
First question, what does the age distribution on OKCupid look like?
One thing that jumps out at me is the shape of these various curves. Data will often go “Nudge, nudge, wink, wink, I might be a gamma distribution.” In response to this nudge and wink we can start to think about what might cause this shape in the data to appear. There is a wealth of theory on the gamma distribution as it comes up frequently in all sorts of branches of science.
One place it comes up particularly often is in models of wait times. Waiting for death, waiting for your next car accident, waiting for the devil to pick up the phone (although waiting for the devil is actually a Poisson distribution, I’m not making this up BTW). What we see here, is the distribution of waiting times until a *terminal* relationship (the one which removes you from the dating pool, with or without appropriate ‘Fatal Attraction’ references).
Caveat for the more mathematically inclined: yes, I know that a more complicated model of entering and exiting relationships in which we track much more detailed information would probably reveal that the distribution deviates from the gamma distribution in some way, but to a first order approximation, this is a good explanation of the data.
Out of the 29,035 profiles, I had 18,429 males and 10,606 females. That gives an average ratio of male to female of about 1.737. I plotted out the geographical distribution of this ratio by state and struggled for a while with the best way to plot it, until I was fooling around with the table filters and set the table to only show the states with a ratio less than 1.737.
My interpretation? West of the Mississippi, thar’ be sausages.
Locales with the fewest men to women on OKC: Washington DC (.862) followed by New Hampshire: (1.145)
Neil Gaiman fan? Bible fan:
My original concept for this blog was to look at the geographical distribution of OKCupid profiles with mentions of Neil Gaiman or one of his books. But there were simply not enough profiles to give me enough statistical power to understand the distribution across states. Gaiman or one of his books is mentioned in a little more than 1% of profiles that I’ve downloaded. The distribution of readers across the country can’t be generated (at least without enormous error bars) with only about 300 data points.
I still wanted to look at distribution of elements out of the OKCupid essay data, so I turned to something that was more robustly reported, particularly the bible or mentions of god or jesus.
To the surprise of absolutely no one, the deep south has the greatest use of religiously oriented vocabulary across the US.
I’ve had a great deal of fun with Fusion tables. They’ll definitely remain in my data visualization arsenal. They allow for quick generation of visuals especially with geographic data that would otherwise take a long time to generate. I’d encourage you to go and play around with one of the various tutorials they have. I really like how quickly the platform let me extract some visual insights form the OKCupid data.
If you have an idea for a dating map you’d like to see, let me know in the comments. I’ll continue to play with the data I have and see what pops up.
I presented some observations about gender and geographical differences. Some of them make intuitive sense, some of them are amusing, all of them rise to the statistical level of “hypothesis generating.” That is to say, due to several vagueries in the collection of the profiles there is not sufficient statistical power to be definitive. That said, I stand by my observation “West of the Mississippi, thar’ be sausages.”