IBM Data Science Professional Certificate Capstone

Introduction

When I first connected to the Foursquare API, I did a test search around my current location (Portsmouth, VA) and found only a dismal selection of venues to look at. For my capstone project, I thought it might be interesting to see if I could use the Foursquare API to locate other Virginia cities with a similar Foursquare venue profile to found out what characteristics these cities share in common.

I think this information might be valuable to providers of APIs like Foursquare, because it would allow them to enact targeted plans to engage excluded communities and increase coverage of their app.

Data Sources

  • List of Virginia Cities from Wikipedia
  • Geolocation data scraped from Geonames.org
  • Foursquare API

Methodology

I started by web-scraping a table of Virginia cities from Wikipedia. Next, I used the list of Virginia cities to look up corresponding zipcodes and geographic coordinates from Geonames.org. Once I had compiled a list of cities and zipcodes, I used the Foursquare API to search for venues near each location. Next, I had to clean up the data to remove duplicates of venues that were within the search radius of more than one city.

I started with a high level analysis using only the “Foursquare venue density” of each city. I divided the density into quartiles to visualize the results. This revealed some obvious trends related to population density, so I decided to further break down the data using a k-means clustering algorithm using the venue categories.

Results

My initial cursory analysis looked only at the venue density near each city. Using a histogram of the venue density revealed that Portsmouth was actually had much higher venue density than I had originally thought (at 9.125 venues/zipcode). A significant number of Virginia cities had no Foursquare venues at all.

Number of cites by approximate Foursquare venue density

To get a rough idea of where these low and high density areas were, I color coded each city by quartile.

VA Cities by Foursquare Density Quartile

To get a better idea of what venues were present in each location, I used a kmeans clustering algorithm to identify cities with similar types of venues. These algorithm produced the following clusters:

Cluster #1 contained the cities of Chester Gap and Ferrum. This cluster scored high in venues identified by ‘Sandwich Place’ and ‘Trail’.

Cluster #2 contained the cities of Penhook, Greenway, Great Falls, West Mclean, and Culpeper. This cluster scored high in venues identified by ‘Garden Center’, ‘Home Service’, and ‘Park’.

Cluster #3 contained only the city of Fort Monroe and was identified based on ‘Beach’ venues.

Cluster #4 contained the majority of Virginia Cities. This includes: Marshall, Suffolk, Emporia, Elliston, Colonial Beach, Colonial Height, Martinsville, Vienna, Burke, Centreville, Hampton, Danville, Newport News, Virginia State University, Rocky Mount, Williamsburg, Reston, Poquoson, Mount Vernon, Vinton, Lorton, Lexington, Dunn Loring, Radford, Fairfax, Salem, Waynesboro, Norton, Buena Vista, Lynchburg, Staunton, Chesapeake, Annandale, Henrico, Winchester, Manassas, Fairfax Station, Virginia Beach, Roanoke, Covington, Portsmouth, Fredericksburg, Newington, Fort Eustis, Herndon, Chantilly, Hopewell, Springfield, Charlottesville, Bristol, Fort Belvoir, Falls Church, Richmond, Clifton, Norfolk, Alexandria, Harrisonburg, Mc Lean, and Merrifield. These cities were characterized by having a high diversity of Foursquare venue types.

Cluster #5 contained Virginia cities with no Foursquare venues at all. This includes: Wirtz, Warsaw, Franklin, Galax, Glade Hill, Haynesville, Henry, Village, Farnham, Oakton, Catawba, Callaway, Randolph, Redwood, Sharps, Boones Mill, Bent Mountain, Union Hall, Petersburg, and Waterford.

These clusters produced the following map:

Virginia Cities by Foursquare Venue Type Cluster

The source code used in this analysis can be found on GitHub:

https://github.com/rruff82/Coursera_Capstone/blob/master/Coursera%20Capstone.ipynb

Discussion

This map showed several similarities with the venue density map, implying that the number of venues was a very strong factor in the clustering. Foursquare seems to have a very limited venue selection outside of major cities and urban areas.

The rural cities primarily had venues related to natural attractions (parks, trails and beaches). This pattern suggests that Foursquare might be able to improve its coverage of Virginia cities through advertising targeted at outdoor activities. Partnering with companies like REI or Bass Pro for incentives to increase the Foursquare user-base could potentially draw in users from under represented Virginia cities.

Conclusion

Overall, the results of this analysis were very much in line with what one would expect based on population. Cities with high populations had higher rates of Foursquare usage, and thus more venues to work with. Foursquare would need to make a substantial effort to draw in users from rural areas to increase coverage in Virginia. A more detailed analysis should try to control for factors like population and demographic information to rule these out as intermediate factors before attempting to identify patterns in venue types.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.