United States

After traveling through 16 states and roughly twice that number of cities in 2013, I wanted to find a way to find similarities between each of the different cities and determine which city I would fit in best. I had some mental measurement, but the limited amount of time in each city prevents any accurate long term statements about any particular city. While facebook or twitter, would probably have provide deeper datasets, the data would require a detailed model of natural language parsing and fake like rejection to beat the noise in each measurement. So rather than dealing with those hard questions, I decided to dive into the OkCupid dataset to try to get some zeroth order results on each question.

OkCupid, gives you the match, friend, and enemy percentages for other members in relation to you. These percentages are generated by a correlating a large number of questions that you can answer, with your given importance rating for each question. For this project I will first just focus on the match percentage of females 24-32 as my measure of similarity of each city. Originally this measure was used to match census data, but I would like to go back and extend this to including males 24-34, and include a larger age range for both. To be clear, this will be focused on the following two questions:

What US cities match me the best?
Are there significant differences between the population in cities, or is the dominate variable just population?

For some reason I think I should mention that this should not be taken that I should only live in city x because it matches me, nor do I think that one should use this technique to find potential matches. Honestly by publishing this I think it probably hurts my match finding possibilities. This was merely a exploration of python and and what trends I could see in OkCupid data.

To keep things somewhat clean and fast let me load a bunch of data that I will use below. (This post was generated by a iPython notebook – More on that later.) Most of the heavy lifting is done using requests, and json. I would like to rewrite most of the data storage using pandas, but the hacking to beer ratio did not lend itself to more than quick scripts. The code for this project will be open, and just needs me to clean up one of my libraries.

1
2
3
4
5
6
7
8
9
10
11
12
13
14 from pprint import pprint
from pycupid import visualize, api, cluster
from pysurvey.plot import setup, line, legend

ax = visualize.setup_plot()
people = api.loadRandom('random2')
lats, lons = visualize.getLatLon(people, update=False)
points = np.array(zip(lons,lats))
shapes = visualize.getShapes()
pops = visualize.loadPopulation()
llpolys = visualize.getPolys()
polys = visualize.getPolys(ax.m)
# Empty plot so hide
pylab.close()

Originally the plan was to grab census data to normalize out the population differences for each city. This lead me to learn about the horrible FIPS standard, and using the excelent census and us packages. In the end, I decided against using this great dataset to limit introducing possible selection biases between the OkCupid population and the highly complete census data. I mainly realized that OkCupid depends highly on the amount of advertisement it might be doing in each market segment, and the “friend” non-linear effect (a friend gets you to go on it since they are on it). Nevertheless I will do some quick comparisons between the census data and OkCupid data.

1
2
3
4
5
6
7
8
9 def plotFemale():
    out = {key:np.log10(pops[key]['female']) for key in pops}
    visualize.plotpoly(polys, out,
                       cmap=pylab.cm.YlOrRd, clim=[0,5.5],
                       clabel='Females Age$\in[25,34]$', 
                       cticks=[1,3,5], 
                       cticknames=['10','1000','10,000'])
    return out
nCensus = plotFemale()

Above you may have noticed that I loaded the random data set. While I call it random, I really doubt that it is really a pseduorandom subsample of the OkCupuid population. I think it is really dominated by a mixture of recent logins, and recent signups. However since I will be taking a ratio of the high matches to the entire sample it should be a fair comparison.

Note to future me. I should go to PA and have coffee with Ms. 99%.

1
2
3
4
5
6
7
8 print 'Cities with the five largest samples:'
for p in toploc(people, n=5, quiet=True):
    print '  {0[0]:20s} {0[1]}'.format(p)

print
print 'Cities with the five highest matches:'
for p in top(people,'match', n=5, quiet=True):
    print '  {:20s} {}'.format(p['location'], p['match'])

Cities with the five largest samples:
  Chicago, IL          447
  New York, NY         307
  Philadelphia, PA     261
  Los Angeles, CA      254
  Washington, DC       215

Cities with the five highest matches:
  Philadelphia, PA     99
  Arcata, CA           98
  Chicago, IL          97
  Pittsburgh, PA       97
  Palo Alto, CA        97

Comparing this with the population data using the fips keys, we can clearly see the differences in the sampled population, and the census population esimate. This is ordered by the estimated census population.

1
2
3
4
5
6 print ' fips  Census OkC Main City'
for key,logPop in top(nCensus, n=10, quiet=True):
    ii = visualize.whereShape(points, shapes[key]['shape'])
    print '{0:5s} {1:,} {2[0][1]:3d} {2[0][0]:20s} '.format(key, 
                             int(10**logPop),
                             toploc([people[i] for i in ii], n=2, quiet=True))

fips  Census OkC Main City
06037 732,245 254 Los Angeles, CA      
17031 417,627 447 Chicago, IL          
48201 321,556 209 Houston, TX          
04013 270,992  91 Phoenix, AZ          
06073 225,104 110 San Diego, CA        
36047 218,981 183 Brooklyn, NY         
06059 202,770  19 Anaheim, CA          
48113 190,158 114 Dallas, TX           
36061 180,468 307 New York, NY         
36081 180,284  20 Astoria, NY

So it might not be the best estimator of the underlying population, but lets just see how things are scattered.

1
2
3
4
5
6 def plotMatches():
    ax = visualize.setup_plot()
    x,y = ax.m(lons, lats)
    pylab.plot(x,y,'.', alpha=0.25, label='Meat Popsicle')
    legend(loc=1)
plotMatches()

The number per county map generally matches well with the census data, but with a large amount of scatter for low density counties.

1
2
3
4
5
6
7
8
9
10
11
12
13 def plotOkc():
    out = {}
    for key in shapes:
        ii = visualize.whereShape(points, shapes[key]['shape'])
        if len(ii) > 0:
          out[key] = np.log10(len(ii))
    visualize.plotpoly(polys,out, 
                       clim=[-0.1,2.1], 
                       clabel='Sample of OkCupid',
                       cticks=[0,1,2], 
                       cticknames=['1','10','100'])
    return out
nMatch = plotOkc()

It is clear from the above figure that there is a large number of really bright counties that do not have very many sampled points. To limit these counties from dominating the comparison, I will require that there is at least ten samples, and look at the fraction of people above 90%. Using any match percentage limit above 75-80% gives roughly similar results.

I do not know what is going on with the population spikes at 10% and 20%. I presume this is the OkCupid servers adding some enemy “spice” to the results.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 def plotMatchDist():
    x = [p['match'] for p in people]
    setup(figsize=(14,6), subplt=(1,2,1),
          xr=[0,100], xlabel='Match Percentage', ylabel='Number')
    xbin = np.arange(0,100,2)
    pylab.hist(x, xbin, alpha=0.7, label='OkC Random Sample')
    mx = 90
    line(x=mx, label='High Match [{:0.0f}%]'.format(mx))
    legend(loc=2)
    
    setup(subplt=(1,2,2), 
          xr=[1,300], xlog=True, xlabel='Number/County', 
          yr=[1,1000], ylog=True)
    pylab.hist([10.0**x for x in nMatch.values()], 
               np.logspace(0,2.3,10), alpha=0.7, 
               label='Number of People/County')
    legend(loc=1)
    line(x=10)
    
plotMatchDist()

One interesting bit that we can find from the data is the roughly linear number of sampled individuals in each county to the census population estimates. This suggests that there is not a dominate non-linear population friend effect (you get on OkCupid if your friends are on it).

1
2
3
4
5
6
7
8
9
10
11
12
13 def plotSize():
    x = [10.0**nCensus[k] for k in nMatch]
    y = [10.0**nMatch[k] for k in nMatch]
    setup(figsize=(8,8),
          xr=[1e3,7.9e5], xlog=True, xlabel='Census',
          yr=[1.0,700], ylog=True, ylabel='OkCupid',
          )
    pylab.plot(x,y,'s', label='County Samples / Census People')
    pylab.plot([1e3,1e6],[1,1e3],'r', lw=3, 
               label='One Sample / Census person')
    legend(loc=2)

plotSize()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21 def plotFraction(matchlim=90):
    matches = np.array([p['match'] for p in people])
    out,extra = {},{}
    for key in shapes:
        ii = visualize.whereShape(points, shapes[key]['shape'])
        if len(ii) > 10:
            jj = np.where(matches[ii] > matchlim)[0]
            out[key] = len(jj)/float(len(ii))*100.0
            cities = toploc([people[i] for i in ii], 
                            n=3, quiet=True)
            extra[key] = dict(percent=out[key],
                              match = matches[ii],
                              number = len(ii), 
                              numberhigh = len(jj),
                              cities=cities)
    visualize.plotpoly(polys, out, 
                       cmap=visualize.reds, clim=[1,14],
                       clabel='Percent Match 90%+')
    return extra
        
nFraction = plotFraction()

1
2
3
4
5
6
7 tmp = [nf for nf in nFraction.values() if nf['number'] > 40]
print 'Num   Percent   Three Largest Citites'
for t in top(tmp, 'percent', n=20, quiet=True):
    cities = ', '.join([c[0] for c in t['cities']])
    print u'{0:3.0f} {1:8.1f}    {2}'.format(t['number'], 
                                             t['percent'], 
                                             cities)

Num   Percent   Three Largest Citites
194     22.7    San Francisco, CA
 59     15.3    Ann Arbor, MI, Ypsilanti, MI, Belleville, MI
215     14.9    Washington, DC
 50     14.0    San Mateo, CA, Redwood City, CA, Daly City, CA
310     13.9    New York, NY, Columbia City, IN, New City, NY
198     12.6    Cambridge, MA, Somerville, MA, Lowell, MA
139     12.2    San Jose, CA, Palo Alto, CA, Sunnyvale, CA
183     11.5    Brooklyn, NY
 80     11.2    Fairfax, VA, Herndon, VA, Springfield, VA
190     11.1    Boston, MA, Brighton, MA, Dorchester, MA
 69     10.1    Brookline, MA, Quincy, MA, Braintree, MA
641     10.1    Los Angeles, CA, Long Beach, CA, Pasadena, CA
 61      9.8    Arlington, VA
 82      9.8    Nashville, TN, Antioch, TN, Hermitage, TN
257      9.7    Seattle, WA, Renton, WA, Kent, WA
175      9.7    Oakland, CA, Berkeley, CA, Fremont, CA
261      9.6    Philadelphia, PA
107      9.3    Silver Spring, MD, Bethesda, MD, Rockville, MD
 43      9.3    Tulsa, OK, Broken Arrow, OK, Owasso, OK
194      8.8    Minneapolis, MN, Eden Prairie, MN, Hopkins, MN

Clustering

Rather than being stuck with county lines, lets switch to clustering. Here I use Scikit’s meanshift to group together a number of different citites by geographical location. It might be also interesting to add in the match percentates into the algorithm as a better estimator of different populations in each geographical region.

For now I am sticking to pretty large groups (quantile=0.01). Decreasing this number divides up the regions into lower number, but general more interesting groups, that I will investigate later. I hoped that countouring the region would give mega-regions of high interest, but the high frequency of cities is not being captured by this visualization technique.

1
2
3
4
5
6
7
8
9
10 reload(cluster)
def plotClustermap():
    visualize.setup_plot()
    nC = cluster.plot(peo, X, labels, centers, nlimit=40,
                      addhist=False, addcity=True, 
                      label='Fraction of High Match (90%+)',
                      cmap=pylab.cm.YlOrRd, crange=[0,12])
    return nC

nCluster = plotClustermap()

1
2
3
4 print '  Percent  Num   Top Three Largest Metropolitan Areas'
for n in nCluster:
    if n[0] > 7:
        print(u'{0[0]:8.2f}  {0[1]:4d}   {0[2]}'.format(n))

Percent  Num   Top Three Largest Metropolitan Areas
   12.58   914   San Francisco, CA, Sacramento, CA, San Jose, CA
    9.73   946   New York, NY, Brooklyn, NY, Bronx, NY
    9.01   755   Boston, MA, Cambridge, MA, Somerville, MA
    8.89    45   Eau Claire, WI, Rochester, MN, La Crosse, WI
    8.84   962   Los Angeles, CA, Long Beach, CA, Riverside, CA
    8.38  1110   Washington, DC, Baltimore, MD, Arlington, VA
    7.66   444   Seattle, WA, Tacoma, WA, Renton, WA
    7.50    40   Roanoke, VA, Blacksburg, VA, Christiansburg, VA
    7.45    94   Montreal, Quebec, CA, Laval, Quebec, CA, Montclair, CA
    7.35    68   Burlington, VT, Plattsburgh, NY, South Burlington, VT
    7.07   410   Portland, OR, Beaverton, OR, Salem, OR

Differences in Cities

Using the match percentage distributions we can see how different each city population is from the others.

1
2
3 dist = cluster.plotdist(peo, labels, 
                        cmap=pylab.cm.YlOrRd,
                        crange=[0,12])

Calculate the KS two sample statistic to see how probable the differences are.

1
2
3 cluster.compare_cities(dist,
                       label='KS Probability\n'+
                       '(Dark == likely difference in populations)')