I’m very taken with the general move towards more data from primary sources. Councils, government orgs etc. putting stats, facts, figures and information online for us to use and mashup. Those orgs who are savvy enough to drive this stuff through RSS make it even easier for us to harvest this stuff and add an extra dimension to our news gathering.
Of course the public sector moves slowly when it comes to IT and it’s no surprise that there are still a majority of orgs that hide their content away on static pages. No RSS feed to help there. So what do we do?
Well we could resign ourselves to adding them to the list of pages that we bookmark and visit. A bit like those regular calls we make to keep our contacts book fresh; no bad thing. But another solution is to use on of the many RSS services on the web to ‘scrape’ the page for content and convert it in to a feed.
Preston city council (the council nearest to me at work) has a few feeds but none around the basic operation of the council – meetings, decisions etc. This kind of thing would be great to get a feed of. So I thought I would give it a go with their published decisions page using Feed43
The first thing I did was set the search so that it showed all results. That way any new ones would show up by default. I did this by using an * in the search box. The * is a standard operator for a wild card or ‘any matches’. So it seemed a logical punt to try it.
The next step was to copy the web address to feed my RSS maker. The URL looks complex but it contains all the information needed to drive the search.
The first step with Feed43 is to feed it the URL then click Reload. It pulls in the whole page and then you get the hard bit. The idea with feed scrapers is to give it enough information about the way the stuff you want is presented that it can ‘spot’ the stuff and ignore the rest. This means trawling through some HTML.
You get two options
The global search pattern looks for HTML that ‘wraps’ the content you want to make in to a feed. It could be the whole table that contains the search results. But this doesn’t really help in this case.
Better to go straight to the second option which defines the specific things to look for to define an item to be added to the feed. Here’s what I put.
<td > <a href=”{%}” title=”{*}”>{%}</a></td>
In feed43 language {*} means this could be anything, just ignore it. {%} means this is important so store it.
So I can saw from the HTML that each decision in the list looked like this
<td > <a href=”http://preston.moderngov.co.uk/ieDecisionDetails.aspx?ID=348&displaypref=0″ title=”Link to decision details for North West England Regional Spatial Strategy Partial Review Consultation”>North West England Regional Spatial Strategy Partial Review Consultation</a>
So I told feed43 to look for anything between the <td> </td> tags regardless of what ‘class=’ said. Then I told it to grab the href link as the actual weblink, ignore the title and then grab the text between the <a> tag to use as a title.
Clicking extract will filter the content and show you the results. You can see they are split in to {%1} for the link and {%2} for the title of the decision.
The last step is to define which of these makes up the key parts of the feed. You can see it’s pretty straightforward to fill the gaps at this point. Your feed is then ready to go. All you need to do is subscribe in the normal way
Moving beyond the basics
The thing that makes scraping pages difficult is picking through the HTML. Feed43 makes this easier by limiting the number of options to filter by. But if you need to push further in then you will need to explore other options. One to consider is Yahoo pipes which has a page grabber option. But you will also need to invest some time in understanding regular expressions.
I think this kind of stuff is more an more important for orgs and journalists especially when it comes to councils and government orgs. We all know how ‘mundane’ many see this stuff (important as it is). So making it in to a feed would be more conducive to newsgathering by stealth. Encourage more ‘passive aggressive newsgathering’ as Paul Bradshaw once described it.
5 Responses
Egrommet
October 26th, 2009 at 11:23 am
1I’ll be looking at that with interest, I’ve played with Dapr and found it a bit flaky and I’m currently trying to get to grips with OpenKapow to build my own rss robots too – but anything for an easier life is greatly welcomed.
Andy
October 26th, 2009 at 3:43 pm
2I think the sure fire way is with something like Pipes and some heavy lifting using reg-ex. I’m sure that feed43 is no more or less flaky than Dapper. Just tried setting the same feed up and it was a little easier (nice visual interface) but a few of the other pages I made with feed43 were not as easy.
links for 2009-10-28 « Sarah Hartley
October 28th, 2009 at 7:03 pm
3[...] Making an RSS feed where there isn’t one. by andydickinson.net Well we could resign ourselves to adding them to the list of pages that we bookmark and visit. A bit like those regular calls we make to keep our contacts book fresh; no bad thing. But another solution is to use on of the many RSS services on the web to ‘scrape’ the page for content and convert it in to a feed. (tags: preston rss howto) [...]
Weekend reading: Five great blog posts I’ve read this week « David Higgerson
October 30th, 2009 at 9:24 am
4[...] How to create an RSS where there isn’t one [...]
Cool Links #64: The Deadline Post « TEACH J: For Teachers of Journalism And Media
November 2nd, 2009 at 2:57 am
5[...] 9 – I hate it when I come across a blog or video site that doesn’t have an RSS feed. Andy Dickinson has a quick tip on how to make your own RSS feed. [...]
RSS feed for comments on this post · TrackBack URI
Leave a reply
Of interest
Stuff you may be looking for
Categories
Archives
Widget
JOURNALISMDAILY.COM
Disclaimer
Andy would like to point out that the views expressed in this blog are his own and do not reflect the views of the University or Department of Journalism.
RSS Feed
A word from our sponsors
What I'm twittering.
journalism
Mac
Online
Tools
video
Recent Posts
Recent Comments
my del.icio.us
andydickinson.net is proudly powered by WordPress - BloggingPro theme by: Design Disease