Wednesday, May 26, 2010



A few weeks back, the Twitter Census datasets were released by Infochimps. There are several datasets in the collection that is comprised of a scrape of Twitter's 40 million users..

I downloaded the Twitter Users by Location dataset to explore. I first unzipped the file, added a .csv file extension to it and opened it in Excel to see what the data looked like. Basically a long column of entries from the location field on the Twitter users's profile. There are 3.6 million rows in this dataset, so Excel wasn't quite capable of doing the work. I switched to a terminal screen and used cat to look around. While letting cat stream the data up the screen, I saw two large blocks of clean coordinates. One block were users who used their iPhone to put coordinates in the location field in Twitter, the other block must have been another phone type(BB?) doing the same as they both had consistent characters prefixing the geocoding. I used the prefix to match and extract those lines into a separate file, loaded them into a mysql table, ran some delete commands to remove invalid coordinates, and ended up with over 500,000 points to map. You know the rest, I connected to the db with Tableau and watched the map render. Some of the maps are in a Picasa Album along with some more abstract images from the map.


To do these, I just set the map layer washout to 100% and started zooming in to different areas. The image above is Atlanta, Ga. I liked what I saw but wanted to add more color to them. The database this is pulling from is just a table with 3 columns- a unique id, latitude and longitude, so there was nothing there to use as a dimension to apply color to. So, I created some new columns and used the rand() function to populate the first column with random numbers between 1 and 5, and the next column 1-10. These random numbers were then used as the color in Tableau. Below are a couple of results - the first is the eastern US with 5 colors and no other changes, the second one is Atlanta with 10 colors, open circles for the markers and transparency increased. More of these are in the Picasa album and more will be put there.







Again, these images represent a set of 500,000 locations extracted from a larger set of 3.6 million. I think for mapping purposes, the remaining 3 million would just show more of the same, but would certainly fill in some blanks. (For instance, no coordinates were in there for North Korea, but a text search revealed a couple of dozen hits for North Korea.) However, for the abstract images, I think more would be better. I am working on a few simple searches and then some regular expressions to sift through the rest and pull out things that can be mapped. These include addresses and other coordinate sets. The big challenge will be trying to map the ones that just have a city name, for instance Atlanta has around 10,ooo points now from coordinates, but a search for Atlanta reveals at least 10,000 more by name. My plan is to take those returns and use the rand() function, or something else, to randomly generate coordinates within the area of interest and see what happens. Hopefully a purely aesthetic cartography.

Monday, May 17, 2010

Poetry Concat()

In the last post, I talked about reshaping the Stand categories to change it from wide to long data. The next thing I did was to concatenate, or reassemble, the responses. The four questions were divided into up to five sections for categorization. In order to easily browse the responses, I put them back together. I used Excel to simply merge the columns and loaded the resulting file in to mysql. To keep the sections neatly separated, I could have inserted a column between them with a special character or string that would mark the change or be replaced by a space or some other delimiter. I didn't. The result is much more interesting as evidenced from a few selections below. The csv file of the concatenated responses can be downloaded and combined with the one for the categories. Two more tables and you will have the basic database I use for ad hoc queries. Woo-hoo!

On to the poetry concat()....

see above healthier population see #2
walkable downtown mountains
outdoors affordable
downtown weather
future potential lifestyle values
see above no rich tards running it no snow see #2
aquarium fiber optic lines
downtown river bridges boats grass
see #1 plus direct air flight to NOLA see #2
drugs that would take care of crime
recycle not litter carpool
visitor - not sure
clean more well restored buildings
see above see #2
clean drug and gang free
i just did
? plant stuff
historic clean safe religious
traffic traffic traffic traffic traffic
education taxes litter pollution
crime education apathy
see above see #2

Ok, that is just a few combinations of words that I liked and found in a few minutes of scanning the first few hundred responses, found a theme, ran some queries and went with it. Many many more and better ones are in there. More gems to come.

Note: If the main download link doesn't work, try the one just above it to the left.



Monday, May 10, 2010

Stand cat reshape

The first thing I did when I downloaded the Stand results was to load the csv file into Tableau. It was able to do some nice basic mapping, but when it came to graphing and mapping categories it wasn't happy. The categories are assigned to the responses in pieces.( A brief data dictionary might be in order.) To the best of my knowledge, the 4 Stand questions were broken into 5 possible responses each. This was derived from the fact that the paper survey, which most of the results came in on, had 5 lines under each question. So some people filled it out as bullet points, 1-2-3-4-5, while others used the lines for a narrative response. Either way, each question was broken into 5 parts and each part could receive 3 category designations. Four questions, 5 parts, 3 categories gives us the potential of having 60 category designations per question/respondent. This is wide data. Many columns per row. Whether due to the way Tableau works or my ignorance of it, I couldn't get the wide data to work when trying to do category analysis per response record or per category designation. I needed to take the wide data and make it long data, or one row per respondent per category number. Excel was the savior in this. I downloaded an extension for Excel that did just that and reshaped the data to create one row for each designation. Five minutes later I had a spreadsheet with case numbers and categories that their responses fit into. Luckily, not all of the responses were 5 lines with 3 categories, otherwise it would have gone beyond Excel's limits. It turned out 273,000+ rows. I used this sheet for one of my four final database tables for looking at Stand data. I will outline and share the others in upcoming posts. While I do have 2 primary dbs of Stand data, I mostly use the Stand web interface for quick queries and preliminary scanning.

Note: My intention was to put up an interactive visual Stand exploration tool on this blog. It might still happen as I imagined, however the free Tableau Public tool has a row limit which the category sheet above more than surpasses. I hope to still create a map interface that then displays results from the web interface via its API.

Wednesday, May 5, 2010

2010 Shooting Map Update


This map updates the previous one, so covers and additional 3 weeks. I will post the same data by zip code later. If interested, you can then look at Stand results under the Crime subcategories and see how it meshes with these maps. An example is the subcategory for Violence.

Followers