Saturday, March 12, 2016

Hacking the Election (Data)

A student at the Coder School was creating his own election results program, to print out who won or if there was a tie. He was doing a lot of stuff manually, like entering his own data, and writing lots of conditionals for "if Candidate_A > Candidate_B and Candidate_A > Candidate_C" and so on. I figured Python could offer us a way to get real data and visualize it in a graph.

Python has a built-in module for reading csv ("comma separated values") files you find on the web. I found a .csv file of the results of the Virginia Republican Primary, held less than two weeks ago. I downloaded it and opened it in a spreadsheet program, but it had a lot of info I wasn't interested in:

This wasn't even the whole sheet, which contained 22 columns, from "Election Name" to "District Type" to lots of crazy "IDs." I was only interested in the "Last Name" and "Total Votes" columns, D and F.

And instead of just a dozen or so rows of the total votes each candidate received in Virginia, I had 38,000 rows of all the totals for each candidate from every county in Virginia! Clearly some pruning was needed. I saved the file as "results.csv" and fired up Python to sort things out.

I started by adapting the "read_csv" function from Chapter 3 of Amit Saha's brilliant book Doing Math With Python. The csv module has a "reader" function which loops over the rows of the file. This code will open the file and start an iterator called "rows":

def read_csv(filename):
    '''Returns the results in a dictionary'''
    with open(filename) as f:
        rows = csv.reader(f)
        next(rows)

So now I can go through all the rows and add up everybody's votes. I'll put the 4 top candidates (Trump, Cruz, Rubio and Kasich) into a dictionary called "results" and start their totals at 0:

results = {'Trump':0, 'Rubio': 0, 'Cruz':0,'Kasich':0}

Now the program will loop over the rows and check if column D is one of those four names. If so, it'll add the number in column F to his total. The indices start at 0, so column D is the item in "row" with index 3, or "row[3]" and column F is "row[5]."

for row in rows:
    if row[3] in results:
         results[row[3]] += int(row[5])

That should work perfectly, but instead it gives me a ValueError because it hit a blank row or something. Let's just have it continue if it hits a snag like that:

for row in rows:
    try:
        if row[3] in results:
            results[row[3]] += int(row[5])
    except ValueError:
        continue

Now we can print the results!

print(results)

And at the bottom of the file, call the "read_csv" function on running the program:

if __name__ == '__main__':
    read_csv('results.csv')

These are the totals for the top 4 Republican candidates:

{'Kasich': 97791, 'Rubio': 327936, 'Trump': 356896, 'Cruz': 171162}

There was no mistake: those totals match the official results on http://results.elections.virginia.gov/. After printing the results we could add some code to print the winner. First we find the "key," in this case the name, associated with the maximum value. Then we print it.

winner = max(results.keys(), key=(lambda k: results[k]))
print('The winner is {0}, with {1} votes.'.\
       format(winner,results[winner]))

Now it adds the winner line:

The winner is Trump, with 356896 votes.

To make a nice bar chart out of the numbers, add this code:




And here's the chart:


Did I leave somebody out? If you want to add a candidate, just add them to the dictionary like this:

results = {'Trump':0, 'Rubio': 0, 'Cruz':0, 'Kasich':0, 
           'Carson':0, 'Bush':0}

Run the program and you'll get an updated chart:
(It looks a little different because I ran it in an IPython notebook.)

A great introduction to using Python for data analysis and very timely, too!

Update 3/13/16:
It's even easier to get the totals using a database like pandas:

No comments:

Post a Comment