After traveling through 16 states and roughly twice that number of cities in 2013, I wanted to find a way to find similarities between each of the different cities and determine which city I would fit in best. I had some mental measurement, but the limited amount of time in each city prevents any accurate long term statements about any particular city. While facebook or twitter, would probably have provide deeper datasets, the data would require a detailed model of natural language parsing and fake like rejection to beat the noise in each measurement. So rather than dealing with those hard questions, I decided to dive into the OkCupid dataset to try to get some zeroth order results on each question.
OkCupid, gives you the match, friend, and enemy percentages for other members in relation to you. These percentages are generated by a correlating a large number of questions that you can answer, with your given importance rating for each question. For this project I will first just focus on the match percentage of females 24-32 as my measure of similarity of each city. Originally this measure was used to match census data, but I would like to go back and extend this to including males 24-34, and include a larger age range for both. To be clear, this will be focused on the following two questions:
- What US cities match me the best?
- Are there significant differences between the population in cities, or is the dominate variable just population?
For some reason I think I should mention that this should not be taken that I should only live in city x because it matches me, nor do I think that one should use this technique to find potential matches. Honestly by publishing this I think it probably hurts my match finding possibilities. This was merely a exploration of python and and what trends I could see in OkCupid data.
To keep things somewhat clean and fast let me load a bunch of data that I will use below. (This post was generated by a iPython notebook – More on that later.) Most of the heavy lifting is done using requests, and json. I would like to rewrite most of the data storage using pandas, but the hacking to beer ratio did not lend itself to more than quick scripts. The code for this project will be open, and just needs me to clean up one of my libraries.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from pprint import pprint
from pycupid import visualize, api, cluster
from pysurvey.plot import setup, line, legend
ax = visualize.setup_plot()
people = api.loadRandom('random2')
lats, lons = visualize.getLatLon(people, update=False)
points = np.array(zip(lons,lats))
shapes = visualize.getShapes()
pops = visualize.loadPopulation()
llpolys = visualize.getPolys()
polys = visualize.getPolys(ax.m)
# Empty plot so hide
pylab.close()
Originally the plan was to grab census data to
normalize out the population differences for each city.
This lead me to learn about the horrible FIPS
standard, and using the excelent
census
and us
packages.
In the end, I decided against using this great dataset to limit introducing
possible selection biases between the OkCupid population and the highly complete
census data.
I mainly realized that OkCupid depends highly on the amount of advertisement it
might be doing in each market segment, and the “friend” non-linear effect (a
friend gets you to go on it since they are on it).
Nevertheless I will do some quick comparisons between the census data and
OkCupid data.
1
2
3
4
5
6
7
8
9
def plotFemale():
out = {key:np.log10(pops[key]['female']) for key in pops}
visualize.plotpoly(polys, out,
cmap=pylab.cm.YlOrRd, clim=[0,5.5],
clabel='Females Age$\in[25,34]$',
cticks=[1,3,5],
cticknames=['10','1000','10,000'])
return out
nCensus = plotFemale()
Above you may have noticed that I loaded the random data set. While I call it random, I really doubt that it is really a pseduorandom subsample of the OkCupuid population. I think it is really dominated by a mixture of recent logins, and recent signups. However since I will be taking a ratio of the high matches to the entire sample it should be a fair comparison.
Note to future me. I should go to PA and have coffee with Ms. 99%.
1
2
3
4
5
6
7
8
print 'Cities with the five largest samples:'
for p in toploc(people, n=5, quiet=True):
print ' {0[0]:20s} {0[1]}'.format(p)
print
print 'Cities with the five highest matches:'
for p in top(people,'match', n=5, quiet=True):
print ' {:20s} {}'.format(p['location'], p['match'])
Comparing this with the population data using the fips keys, we can clearly see the differences in the sampled population, and the census population esimate. This is ordered by the estimated census population.
1
2
3
4
5
6
print ' fips Census OkC Main City'
for key,logPop in top(nCensus, n=10, quiet=True):
ii = visualize.whereShape(points, shapes[key]['shape'])
print '{0:5s} {1:,} {2[0][1]:3d} {2[0][0]:20s} '.format(key,
int(10**logPop),
toploc([people[i] for i in ii], n=2, quiet=True))
So it might not be the best estimator of the underlying population, but lets just see how things are scattered.
1
2
3
4
5
6
def plotMatches():
ax = visualize.setup_plot()
x,y = ax.m(lons, lats)
pylab.plot(x,y,'.', alpha=0.25, label='Meat Popsicle')
legend(loc=1)
plotMatches()
The number per county map generally matches well with the census data, but with a large amount of scatter for low density counties.
1
2
3
4
5
6
7
8
9
10
11
12
13
def plotOkc():
out = {}
for key in shapes:
ii = visualize.whereShape(points, shapes[key]['shape'])
if len(ii) > 0:
out[key] = np.log10(len(ii))
visualize.plotpoly(polys,out,
clim=[-0.1,2.1],
clabel='Sample of OkCupid',
cticks=[0,1,2],
cticknames=['1','10','100'])
return out
nMatch = plotOkc()
It is clear from the above figure that there is a large number of really bright counties that do not have very many sampled points. To limit these counties from dominating the comparison, I will require that there is at least ten samples, and look at the fraction of people above 90%. Using any match percentage limit above 75-80% gives roughly similar results.
I do not know what is going on with the population spikes at 10% and 20%. I presume this is the OkCupid servers adding some enemy “spice” to the results.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def plotMatchDist():
x = [p['match'] for p in people]
setup(figsize=(14,6), subplt=(1,2,1),
xr=[0,100], xlabel='Match Percentage', ylabel='Number')
xbin = np.arange(0,100,2)
pylab.hist(x, xbin, alpha=0.7, label='OkC Random Sample')
mx = 90
line(x=mx, label='High Match [{:0.0f}%]'.format(mx))
legend(loc=2)
setup(subplt=(1,2,2),
xr=[1,300], xlog=True, xlabel='Number/County',
yr=[1,1000], ylog=True)
pylab.hist([10.0**x for x in nMatch.values()],
np.logspace(0,2.3,10), alpha=0.7,
label='Number of People/County')
legend(loc=1)
line(x=10)
plotMatchDist()
One interesting bit that we can find from the data is the roughly linear number of sampled individuals in each county to the census population estimates. This suggests that there is not a dominate non-linear population friend effect (you get on OkCupid if your friends are on it).
1
2
3
4
5
6
7
8
9
10
11
12
13
def plotSize():
x = [10.0**nCensus[k] for k in nMatch]
y = [10.0**nMatch[k] for k in nMatch]
setup(figsize=(8,8),
xr=[1e3,7.9e5], xlog=True, xlabel='Census',
yr=[1.0,700], ylog=True, ylabel='OkCupid',
)
pylab.plot(x,y,'s', label='County Samples / Census People')
pylab.plot([1e3,1e6],[1,1e3],'r', lw=3,
label='One Sample / Census person')
legend(loc=2)
plotSize()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def plotFraction(matchlim=90):
matches = np.array([p['match'] for p in people])
out,extra = {},{}
for key in shapes:
ii = visualize.whereShape(points, shapes[key]['shape'])
if len(ii) > 10:
jj = np.where(matches[ii] > matchlim)[0]
out[key] = len(jj)/float(len(ii))*100.0
cities = toploc([people[i] for i in ii],
n=3, quiet=True)
extra[key] = dict(percent=out[key],
match = matches[ii],
number = len(ii),
numberhigh = len(jj),
cities=cities)
visualize.plotpoly(polys, out,
cmap=visualize.reds, clim=[1,14],
clabel='Percent Match 90%+')
return extra
nFraction = plotFraction()
1
2
3
4
5
6
7
tmp = [nf for nf in nFraction.values() if nf['number'] > 40]
print 'Num Percent Three Largest Citites'
for t in top(tmp, 'percent', n=20, quiet=True):
cities = ', '.join([c[0] for c in t['cities']])
print u'{0:3.0f} {1:8.1f} {2}'.format(t['number'],
t['percent'],
cities)
Clustering
Rather than being stuck with county lines, lets switch to clustering. Here I use Scikit’s meanshift to group together a number of different citites by geographical location. It might be also interesting to add in the match percentates into the algorithm as a better estimator of different populations in each geographical region.
For now I am sticking to pretty large groups (quantile=0.01). Decreasing this number divides up the regions into lower number, but general more interesting groups, that I will investigate later. I hoped that countouring the region would give mega-regions of high interest, but the high frequency of cities is not being captured by this visualization technique.
1
2
3
4
5
6
7
8
9
10
reload(cluster)
def plotClustermap():
visualize.setup_plot()
nC = cluster.plot(peo, X, labels, centers, nlimit=40,
addhist=False, addcity=True,
label='Fraction of High Match (90%+)',
cmap=pylab.cm.YlOrRd, crange=[0,12])
return nC
nCluster = plotClustermap()
1
2
3
4
print ' Percent Num Top Three Largest Metropolitan Areas'
for n in nCluster:
if n[0] > 7:
print(u'{0[0]:8.2f} {0[1]:4d} {0[2]}'.format(n))
Differences in Cities
Using the match percentage distributions we can see how different each city population is from the others.
1
2
3
dist = cluster.plotdist(peo, labels,
cmap=pylab.cm.YlOrRd,
crange=[0,12])
Calculate the KS two sample statistic to see how probable the differences are.
1
2
3
cluster.compare_cities(dist,
label='KS Probability\n'+
'(Dark == likely difference in populations)')