Archive for the 'python' Category

tdroza

Playing with Python

For something I was playing around with at work, I wanted to be able to retrieve an rss feed, parse it and post the title/description fields to another website site at timed intervals. These days I only really write in Java and JavaScript but Java seemed like such a longhand way to achieve this. I probably could have written a shell script, but it’s such a long time since I wrote shell scripts that I’d have been starting from scratch so I decided to take a look at Python… and so far I’m impressed. Very impressed.

From start to finish this probably only took a couple of hours and that includes referring back to the api docs for almost every line I wrote. The code below fetches a feed, and extracts the title field from the items. Each time it finds an item, it adds its guid to a text file so that it can ignore items that have been previously processed. I’m sure this can be tidied up lots, but for a first attempt I’m pretty happy (for simplicity I’ve removed the code that posts the items


import urllib
from threading import Timer
from xml.dom import minidom

def retrieveXml(url):
    #get the feed
    f = urllib.urlopen(url)
    xmldoc = minidom.parse(f)
    f.close()

    # read the history (assumes the file already exists)
    historyFile = open('./history.dat','r')
    history = historyFile.read()
    historyFile.close()
    found = False
    # iterate through each item in the feed
    items = xmldoc.getElementsByTagName('item')
    for item in items:
        title = item.getElementsByTagName('title')[0].firstChild.toxml()
        guid = item.getElementsByTagName('guid')[0].firstChild.toxml()
        # if the current item isn't in the history, then use that
        if history.find(guid) < 0:
            found = True
            break
    # if we found a new entry while iterating over the feed...
    if found:
        historyFile = open('./history.dat','a')
        historyFile.write(guid  + "n")
        historyFile.close()
        # for now just print the title to screen
        print title
    t = Timer(10.0, retrieveXml, [url])
    t.start()

retrieveXml('http://f1.gpupdate.net/en/xml/rss/1.xml')
[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]