Skip to content

Reading and Parsing an Atom Feed

december 8, 2011

English: This icon, known as the "feed ic...

Image via Wikipedia

Article copied from December 2011 Clippings.

**************************************************************
GURU GUIDANCE
**************************************************************

READING AND PARSING AN ATOM FEED
By Julian Robichaux, nsftools.com

In my previous article, I discussed how to read and parse an RSS feed using Java. In that example, we used standard Java classes to load the feed into an XML parser, and the individual elements of the feed that were important got extracted as we went through the feed. In this article, we will use a feed parsing library to simplify the task.

Atom feeds are becoming much more common than RSS feeds “in the wild”, especially because the Atom feed format is often used to provide REST API functionality for publish/subscribe and document-centric types of operations. The Atom format is also more extensible than RSS, and as such it generally has a lot more elements to sort through when you’re retrieving information.

Luckily, there are some feed parsing libraries that can remove a lot of the complexity and guesswork (and RFC reading) for you.

The Two Kinds of Atom Feeds

There are actually two different kinds of Atom feed formats: The Atom Syndication Format (as described by RFC 4287) and the Atom Publishing Protocol (as described by RFC 5023).

The Syndication Format is more of what most people think of when they talk about “feeds” — like an RSS feed, it is used for listing content like news articles, blog entries, etc. The Publishing Protocol is what is used for REST API applications, where document-type data can be listed, added, edited, and deleted.

In general, with a Syndication feed, you simply read the feed data and you’re done with it; with a Publishing feed, you can drill down into the data, first pulling the high-level Service feed, and then pulling child Collection feeds (sometimes several levels deep) to get your data.

Good descriptions of both feed formats can be found at the AtomEnabled.org Web site: http://www.atomenabled.org/developers/syndication and http://atomenabled.org/developers/protocol

Libraries: ROME, Apache Abdera, Apache Wink

The three major Java libraries I’ve found and attempted to use for Atom feed parsing are ROME ( https://rometools.jira.com/wiki/display/ROME/Home ), Apache Abdera ( http://abdera.apache.org ), and Apache Wink ( http://incubator.apache.org/wink ).

ROME and Apache Abdera both seemed to work well with Notes/Domino 8.5, and both were very easy to use. ROME was smaller in terms of the size of the JAR files and required dependencies, while Abdera had more recent updates.

I was able to make Apache Wink work with XPages, but I consistently got errors trying to use it with script libraries and agents. As a result, I will not show code samples for Wink in this article. However, Niklas Heidloff has a good example of using Wink with XPages on the OpenNTF Web site ( http://www.openntf.org/p/XPages%20For%20Connections and http://www.openntf.org/blogs/openntf.nsf/d6plinks/NHEF-8CCDTQ ).

Including the Libraries in Domino

The first thing you have to do (after choosing a library) is decide how to add the library to your Domino server in such a way that your code can use it. Keep in mind that when I talk about a “library” in this sense, it’s not just a single JAR file. Each of the libraries also has other JAR file dependencies you will need to include, as well.

For XPages, this normally means adding the library files to the WebContent/WEB-INF location in your database. Basic instructions for adding a JAR file to the WEB-INF folder and calling the code from an XPage can be found at http://www-10.lotus.com/ldd/ddwiki.nsf/dx/reuse_java_xpage.htm

For agents, you can import the library files into a script library and then include the script library with your agents. However, if you have permissions to do so on the server (or your Admin is nice enough to do this for you), often the best thing to do is to have the Domino server load the library files on startup and retain it in memory. This will generally give you the best performance because the server won’t have to reload the JAR files every time they are needed. Your two options for doing this are to add all the JAR files to your Domino server’s jvm/lib/ext directory, or to put them in an arbitrary directory on the server and use the JavaUserClasses or JavaUserClassesExt notes.ini variables to point to them.

Reading and Parsing a Syndication Feed

With that as an intro, on to the examples! Here is sample code for reading and parsing an Atom Syndication feed using both ROME and Apache Abdera:

ROME
import lotus.domino.*;
import java.net.URL;
import java.util.Iterator;

import com.sun.syndication.feed.synd.SyndEntry;
import com.sun.syndication.feed.synd.SyndFeed;
import com.sun.syndication.io.SyndFeedInput;
import com.sun.syndication.io.XmlReader;

public class JavaAgent extends AgentBase {
   public void NotesMain() {
     try {
         URL feedUrl = new URL( "http://www.nsftools.com/blog/blog.xml" );
         SyndFeedInput input = new SyndFeedInput();
         SyndFeed feed = input.build(new XmlReader(feedUrl));

         Iterator entryIter = feed.getEntries().iterator();
         while (entryIter.hasNext()) {
             SyndEntry entry = (SyndEntry) entryIter.next();
             System.out.println(entry.getTitle());
         }
     } catch(Exception e) {
         e.printStackTrace();
      }
  }
}
 
Abdera
import lotus.domino.*;
import java.net.URL;
import java.util.Iterator;

import org.apache.abdera.Abdera;
import org.apache.abdera.model.Entry;
import org.apache.abdera.model.Feed;
import org.apache.abdera.parser.Parser;

public class JavaAgent extends AgentBase {
   public void NotesMain() {
     try {
         URL feedUrl = new URL( "http://www.nsftools.com/blog/blog.xml" );
         Abdera abdera = new Abdera();
         Parser parser = abdera.getParser();
         org.apache.abdera.model.Document doc = parser.parse(feedUrl.openStream());
         Feed feed = (Feed) doc.getRoot();

         Iterator entryIter = feed.getEntries().iterator();
         while (entryIter.hasNext()) {
             Entry entry = (Entry) entryIter.next();
             System.out.println(entry.getTitle());
         }
     } catch(Exception e) {
         e.printStackTrace();
      }
  }
}

You can see that the code in both examples is structured in much the same way: You generate a Feed object by passing in a URL, and then you can iterate through multiple Entry objects inside the feed. While the examples only demonstrate getting the title of the feeds, there are also methods like getAuthor(), getContent(), etc.

Also, in both cases the HTTP transport layer is handled for you, so you don’t have to worry about the process of actually making the connection to the feed server and retrieving the content.

Reading and Parsing a Publishing Feed

Here is some sample code for reading and parsing an Atom Publishing feed using both ROME and Apache Abdera:

ROME
import lotus.domino.*;
import java.util.Iterator;

import com.sun.syndication.propono.atom.client.AtomClientFactory;
import com.sun.syndication.propono.atom.client.ClientAtomService;
import com.sun.syndication.propono.atom.client.NoAuthStrategy;
import com.sun.syndication.propono.atom.common.AtomService;
import com.sun.syndication.propono.atom.common.Collection;
import com.sun.syndication.propono.atom.common.Workspace;

import lotus.domino.*;

public class JavaAgent extends AgentBase {
   public void NotesMain() {
     try {
         String endpoint = "http://quickr.example.com/dm/atom/introspection";

         // NOTE: the ROME Propono Atom Service parser is VERY picky about namespaces,
         // and if all the nodes in the feed aren't specifically designated as
         // being in the "http://www.w3.org/2007/app" namespace then the feed
         // parsing will fail.
         ClientAtomService service = AtomClientFactory.getAtomService(endpoint, new NoAuthStrategy());
         Workspace workspace = (Workspace)service.getWorkspaces().get(0);

         Iterator collIter = workspace.getCollections().iterator();
         while (collIter.hasNext()) {
             Collection coll = (Collection) collIter.next();
             System.out.println(coll.getTitle());
         }
     } catch(Exception e) {
         e.printStackTrace();
      }
  }
}
 
Abdera
import lotus.domino.*;
import java.net.URL;
import java.util.Iterator;

import org.apache.abdera.Abdera;
import org.apache.abdera.model.Collection;
import org.apache.abdera.model.Service;
import org.apache.abdera.model.Workspace;
import org.apache.abdera.parser.Parser;

public class JavaAgent extends AgentBase {
   public void NotesMain() {
     try {
         Abdera abdera = new Abdera();
         Parser parser = abdera.getParser();
         URL url = new URL( "http://quickr.example.com/dm/atom/introspection" );

         org.apache.abdera.model.Document doc = parser.parse(url.openStream());
         Service service = (Service) doc.getRoot();
         Workspace workspace = (Workspace)service.getWorkspaces().get(0);

         Iterator collIter = workspace.getCollections().iterator();
         while (collIter.hasNext()) {
             Collection coll = (Collection) collIter.next();
             System.out.println(coll.getTitle());
         }
     } catch(Exception e) {
         e.printStackTrace();
      }
  }
}

As before, the process is essentially the same with both libraries — create a Service object from a URL, get the first Workspace in that feed, then iterate through the Collection objects in the Workspace — although with Abdera, the parsing classes are built-in, while with ROME, there is an additional library called “Propono” that also has to be used. You can just include the Propono JAR in the same location that the other ROME-related JARs are stored.

However, I have found the Abdera parser to be much more “forgiving” of the incoming feed format than the ROME parser. As noted in the code above, if the Atom feed doesn’t explicitly use an XML namespace that ROME Propono is looking for, the parsing will fail. I was able to get around this in testing by doing some minor customizations of the ClientAtomService methods, but in practice it’s often easier just to use Apache Abdera.

Authentication

If you have to authenticate with a username and password in order to access a feed, both libraries have support for that as well. Code snippets for how basic authentication works for each library are as follows:

ROME
         HttpURLConnection con = (HttpURLConnection)feedUrl.openConnection();
         String encoding = new sun.misc.BASE64Encoder().
           encode( (username + ":" + password).getBytes() );
         con.setRequestProperty ("Authorization", "Basic " + encoding);

         SyndFeedInput input = new SyndFeedInput();
         SyndFeed feed = input.build(new XmlReader(con));
 
Abdera
         AbderaClient client = new AbderaClient(abdera);
         client.usePreemptiveAuthentication(true);
         client.addCredentials("https://myserver.example.com", null, null,
                 new UsernamePasswordCredentials(username, password));
         AbderaClient.registerTrustManager();         response = client.get(feedUrl);
         org.apache.abdera.model.Document doc = parser.parse(response.getInputStream());

From there, you can plug back into the code from the Syndication examples earlier in this article.

Performance

I didn’t do very much performance testing to compare the two libraries against each other, so I can’t give an opinion on which one is faster than the other. Unscientifically, they both seemed to have similar performance.

The big performance killer is really just loading the libraries in the first place, because it involves loading several megabytes worth of JAR files into memory before the code can even start running. Script libraries performed poorly, while pre-loading the libraries (either on the classpath or on an XPage) made a huge difference.

Of course, this is also a consideration with using a third-party parsing library like ROME or Abdera, versus retrieving and parsing the feed XML manually like we did in the previous article. Certainly, using fewer libraries and fewer levels of abstraction can sometimes improve performance, especially load-time performance . . . but that’s the classic tradeoff between ease-of-coding and raw speed. If we were only worried about speed, we’d still be programming in assembler.

No comments yet

Geef een reactie

Vul je gegevens in of klik op een icoon om in te loggen.

WordPress.com logo

Je reageert onder je WordPress.com account. Log uit / Bijwerken )

Twitter-afbeelding

Je reageert onder je Twitter account. Log uit / Bijwerken )

Facebook foto

Je reageert onder je Facebook account. Log uit / Bijwerken )

Google+ photo

Je reageert onder je Google+ account. Log uit / Bijwerken )

Verbinden met %s

%d bloggers op de volgende wijze: