Skip to main content

Information Freedom Fighting


My eye caught the City of York Council announcing that they publish all the "Freedom of Information" requests as PDFs ( here ).

The sharp-eyed amongst you will spot that the requests are organised by weeks. Each week's requests are stored in a PDF for that week. Each PDF would need clicking through to that week, then clicking through to that page and then downloading separately (using the handy "Download Now" link ) and then reading. The search engine is pretty hopeless and can't just return FOI requests and so gives you hundreds of results for any query.

Organising FOI requests by week is completely ridiculous, almost as ridiculous as ordering them by the number of words used or alphabetically. Now of course it probably makes sense from the point of view of compiling the requests - it sounds like a "once a week" job for somebody, but to then publish them once a week seems madness.

One of my pet hates is information that is made available but totally impossible to use. It's like saying, "Yes, of course you can have all the data we keep on you, we have written it on mist on these eggshells - would you like us to post it to you?". It's exactly like that.

So, one evening, I wondered if I could retrospectively do something more useful with their data. It is open, so why not.

First I made a crawler with ScraperWiki ( what an excellent tool this is ) that follows all those links and grabs the text from the PDFs. I based this on Martin's Scraper ( thanks Martin ).

I then downloaded the data collected as CSV and tried using the free statistical tool Sci2 for stemming and combining the words. Whilst this approach worked, I didn't like the resulting stemmed words, like glaz, because they just look so unfriendly.

Next, I wrote a python script to strip out stopwords like "the" and "where" etc and count the word popularity of each word in the downloaded file,  keeping track of the URL that it came from, and saved it back to a new .csv file.

from string import *
import  re, HTMLParser
def get_urls(word):
    f = '/Users/tomsmith/Downloads/pdfextractor_1.csv'
    lines = open(f).readlines()
    urls =[]
    for line in lines:    
        url = line.split(",")[0].strip()
        text = lower(line.split(",")[1].strip())
        words = text.split(" ")
        if word in words:
            urls.append( url )
    return urls
    
class MLStripper(HTMLParser.HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_fed_data(self):
        return ''.join(self.fed)
def strip_tags(html):
    #Warning this does all including script and javascript
    x = MLStripper()
    x.feed(html)
    return x.get_fed_data()
def match(s, reg):
    p = re.compile(reg, re.IGNORECASE| re.DOTALL)
    results = p.findall(s)
    return results
    
f = '/Users/tomsmith/Downloads/pdfextractor_1.csv'
lines = open(f).readlines()
stopwords = open( 'stopwords-en.txt').read().split()
d = {}
words = []
for line in lines:    
    url = line.split(",")[0].strip()
    text = line.split(",")[1].strip()
    words = text.split(" ")
    for word in words:
        
        word = lower( word )
        word = match(word, "[a-z]*")[0]
        print word
        try:
            float(word)
            word = ''
            int( word )
            word = ''
        except:
            pass
        
        if word != '' and word != '"\r\n' and word !='"' and len(word) > 1:
            if word not in stopwords:
                print word
                try:
                    d[word] += 1
                except:
                    d[word] = 1
                    
                
        
finalFreq = sorted(d.iteritems(), key=lambda t: t[1], reverse=True)
out = open("tagcloud.csv", 'w')
out.write("word,frequency\r")
for item in finalFreq:
    urls = get_urls( item[0] )
    urls_str = "|".join(urls)
    
    out.write(item[0] + "," + str(item[1]) + "," + urls_str + "\r" ) 
    print item[0], item[1], urls_str
    
out.close()

I then uploaded it as a Google Spreadsheet and turned it into a Web Application.

Problems along the way


I found that displaying 1,000 words a bit of struggle for jQuery ( maybe I made it wrong ) so ended just showing 250 at a time.

I found, of course, that the tag clouding the word by popularity only revealed that the most popular words were fairly meaningless in this context like"foi" and "february" and "council". To do this properly you also need a list of contextual stopwords... some human intervention.

Another glitch was that occassionally, a FOI request would comprise of massive tables of suppliers ( in a PDF remember ) which would skew the words to be "B&Q" and "Wickes" or "building materials" etc. I had to avoid those.


The result


And here it is. A list of words that you can browse and find direct links to the PDFs from which it came. In the end, it seems that showing 250ish words is easier on the technology and the eye. The result is something that you could maybe browse and find a link to something relevant to you.



https://script.google.com/macros/s/AKfycbxAFOZjPBNnpg60MHzFQQjb2TkTsFUSDP_oPrRdNJDg83e3eCc/exec




Comments

Popular posts from this blog

Writing a Simple QR Code Stock Control Spreadsheet

At Theatre, Film & TV they have lots of equipment they loan to students, cameras, microphone, tripod etc. Keeping track of what goes out and what comes back is a difficult job. I have seen a few other departments struggling with the similar "equipment inventory" problems. A solution I have prototyped uses QR codes, a Google Spreadsheet and a small web application written in Apps Script. The idea is, that each piece of equipment ( or maybe collection of items ) has a QR code on it. Using a standard and free smartphone application to read QR codes, the technician swipes the item and is shown a screen that lets them either check the item out or return it. The QR app looks like this. The spreadsheet contains a list of cameras. It has links to images and uses Google Visualisation tools to generate its QR codes. The spreadsheet looks like this. The Web Application The web application, which only checks items in or out and should be used on a phone in conjunctio

Inserting A Google Doc link into a Google Spreadsheet (UPDATED 6/12/2017)

This article looks at using Apps Script to add new features to a Google Spreadsheet. At the University of York, various people have been using Google spreadsheets to collect together various project related information. We've found that when collecting lots of different collaborative information from lots of different people that a spreadsheet can work much better than a regular Google Form. Spreadsheets can be better than Forms for data collection because: The spreadsheet data saves as you are editing. If you want to fill in half the data and come back later, your data will still be there. The data in a spreadsheet is versioned, so you can see who added what and when and undo it if necessary The commenting features are brilliant - especially the "Resolve" button in comments. One feature we needed was to be able to "attach" Google Docs to certain cells in a spreadsheet. It's easy to just paste in a URL into a spreadsheet cell, but they can often

A Working Booking System In Google Sheets

Working with Andras Sztrokay we had another go at a booking system. This time it was to enable staff to book out a number of iPads over a number of days. You select the days you want, then select the Booking menu. Andras did an amazing job. It even creates a daily bookings sheet so you can see who has which iPads. To see this in action, go  here  and  File > Make a Copy (I won't be able to support you this is just provided to maybe give someone else a leg up, good luck!)