Wednesday, May 26, 2010



A few weeks back, the Twitter Census datasets were released by Infochimps. There are several datasets in the collection that is comprised of a scrape of Twitter's 40 million users..

I downloaded the Twitter Users by Location dataset to explore. I first unzipped the file, added a .csv file extension to it and opened it in Excel to see what the data looked like. Basically a long column of entries from the location field on the Twitter users's profile. There are 3.6 million rows in this dataset, so Excel wasn't quite capable of doing the work. I switched to a terminal screen and used cat to look around. While letting cat stream the data up the screen, I saw two large blocks of clean coordinates. One block were users who used their iPhone to put coordinates in the location field in Twitter, the other block must have been another phone type(BB?) doing the same as they both had consistent characters prefixing the geocoding. I used the prefix to match and extract those lines into a separate file, loaded them into a mysql table, ran some delete commands to remove invalid coordinates, and ended up with over 500,000 points to map. You know the rest, I connected to the db with Tableau and watched the map render. Some of the maps are in a Picasa Album along with some more abstract images from the map.


To do these, I just set the map layer washout to 100% and started zooming in to different areas. The image above is Atlanta, Ga. I liked what I saw but wanted to add more color to them. The database this is pulling from is just a table with 3 columns- a unique id, latitude and longitude, so there was nothing there to use as a dimension to apply color to. So, I created some new columns and used the rand() function to populate the first column with random numbers between 1 and 5, and the next column 1-10. These random numbers were then used as the color in Tableau. Below are a couple of results - the first is the eastern US with 5 colors and no other changes, the second one is Atlanta with 10 colors, open circles for the markers and transparency increased. More of these are in the Picasa album and more will be put there.







Again, these images represent a set of 500,000 locations extracted from a larger set of 3.6 million. I think for mapping purposes, the remaining 3 million would just show more of the same, but would certainly fill in some blanks. (For instance, no coordinates were in there for North Korea, but a text search revealed a couple of dozen hits for North Korea.) However, for the abstract images, I think more would be better. I am working on a few simple searches and then some regular expressions to sift through the rest and pull out things that can be mapped. These include addresses and other coordinate sets. The big challenge will be trying to map the ones that just have a city name, for instance Atlanta has around 10,ooo points now from coordinates, but a search for Atlanta reveals at least 10,000 more by name. My plan is to take those returns and use the rand() function, or something else, to randomly generate coordinates within the area of interest and see what happens. Hopefully a purely aesthetic cartography.

No comments:

Post a Comment

Followers