Programming The Web

Actor Model: Multi-threaded Parallel Processing in Java

2013-03-12T12:57:00.002-07:00

Actor Model: Multi-threaded Parallel Processing in Java

Intent

The Actor model provides a constrained way to use multi-threaded parallel processing. Each Actor is used to process queued requests (or "messages") one at a time, as one stage of a pipeline. Below is one way to implement the Actor model in Java.

Motivation

When multiple tasks need to be performed in a pipeline, there is sometimes a desire to execute them concurrently using separate threads, for example to take advantage of hardware with multiple CPU cores. But since concurrent sequential processing is notoriously prone to bugs that are difficult to replicate and isolate, it is helpful to use a programming model that imposes some structure on the use of separate threads and how they can interact.

In the Actor model, each actor runs in its own thread and only operates locally on its own queue of tasks. Multiple actors can be set up in a pipeline to work in parallel, each actor consuming the tasks in its own queue, and potentially adding tasks to the queues of other actors. For example, one actor may save RDF graphs from an HTTP endpoint, while another actor downstream later performs a computation on those graphs.

Another reason for using an actor that run in its own thread, processing one task at a time, is to impose throttling, so that too many threads are not trying to run at once. Throttling is not only helpful in preventing one client from consuming inordinate resources. In many cases it can actually improve total throughput, by preventing resource contention.

Implementation

Each actor is represented by a separate Java class and has its own queue of similar tasks (or messages) that it will process, one at a time. Each task is represented as an instance of the actor class. A task's constructors create an instance that can be sent to the actor by calling the "execute" method on that instance, thus queuing the task for processing. The constructors take required parameters as arguments; optional parameters may be set via setter methods.

Each actor (a Java class) has its own thread, which it uses to asynchronously process the tasks in its queue. Different actors process different type of tasks in different queues and different threads. A different actor thread is held in a static field called "actor" in each task class. The actor is an instance of a standard class called ExecutorService that is provided by the JVM. The queue is inside the actor and maintained by the actor, so custom code does not need to deal with the queue maintenance, thus (hopefully) reducing the opportunity for thread programming errors.

Each task class must implement a "run" method, which will be called by the JVM when it is time to process one instance (or message).

Sample Code

We'll sketch out how to define a task class called GraphReaderTask, which will read RDF graphs from a set of URLs, and store those graphs into an RDF repository using Sesame. First, we'll need to import some of the standard Java concurrency classes:

import java.util.concurrent.*;

Our GraphReaderTask class (or a nested class) must implement the Callable interface:

public class GraphReaderTask implements Callable<Void> {

Here is the static "actor" that holds the thread for the GraphReaderTask:

private static final ExecutorService actor = Executors.newSingleThreadExecutor();

Next, some fields that the GraphReaderTask will need, in processing each message. Each message

private final Repository repository; // A Sesame RDF repository
private final String url; // The URL of an RDF graphs to save
private Future<Void> ctrl;

Now we can define a GraphReaderTask constructor and control methods:

public GraphReaderTask(Repository repository, String url) {
        this.repository = repository;
        this.url = url;
}
public boolean isSubmitted() {
 return ctrl != null;
}
public boolean isCancelled() {
 return ctrl != null && ctrl.isCancelled();
}
public boolean isDone() {
 return ctrl != null && ctrl.isDone();
}
public synchronized void submit() {
 if (ctrl == null) {
  ctrl = actor.submit(this);
 } else {
  throw new IllegalStateException();
 }
}
public boolean cancel() {
 return ctrl.cancel(false);
}
public void await() throws InterruptedException, IOException,
  OpenRDFException {
 try {
  ctrl.get();
 } catch (ExecutionException e) {
  try {
   throw e.getCause();
  } catch (Error cause) {
   throw cause;
  } catch (RuntimeException cause) {
   throw cause;
  } catch (IOException cause) {
   throw cause;
  } catch (OpenRDFException cause) {
   throw cause;
  } catch (Throwable cause) {
   throw new UndeclaredThrowableException(cause);
  }
 }
}

Now we can define a "call" method, which will be invoked by the actor when it is time to process the next task from this actor's queue, and must perform the guts of whatever this actor/task should do. In this example, the GraphReaderTask simply reads an RDF graph from a URL and stores it into our Sesame repository.

Remember that all instances of GraphReaderTask are associated with one actor, that is an ExecutorService, which provides features for shutting down gracefully. So before we actually start doing any work, we first need to check to see if the task is cancelled, and if so, merely return without doing anything (except perhaps writing a note to a log).

public Void call() throws IOException, OpenRDFException {
 if (isCancelled())
  return null;
 URLConnection http = new URL(url).openConnection();
 http.setRequestProperty("Accept", "application/rdf+xml");
 InputStream in = http.getInputStream();

 RepositoryConnection con = repository.getConnection();
 con.setAutoCommit(false);
 try {
  ValueFactory vf = con.getValueFactory();
  URI graph = vf.createURI(url);

  con.clear(graph);
  if (isCancelled())
   return null;
  con.add(in, url, RDFFormat.RDFXML, graph);

  con.setAutoCommit(true);
 } finally {
  con.rollback();
  con.close();
  in.close();
 }
 return null;
}

Now that we have defined the GraphReaderTask class, we need to make use of it.

new GraphReaderTask(repository, url).submit();

This technique allows the caller thread to continue with other processing, while the dedicated graph reader thread takes care of parsing RDF. By using a queue we ensure that threads are not blocked when they could be performing other operations. The await() method can be used by the caller to re-join when the task is complete and propagate any exceptions that may have occurred.

Provenance and Traceability in RDF with Callimachus

2012-10-03T08:33:00.002-07:00

Provenance and Traceability in RDF with Callimachus

All too often software is designed without adequate regard for traceability. Traceability refers to the ability to audit the state of data at any point in the system for correctness and completeness; for any entity in the system all transactions that to lead to the current state and their metadata can be examined, reviewed, and verified. Software is supposed to be designed according to the stakeholders' requirements, but many of these experts take traceability for granted. Most people don't audit most of the time, but the ability to audit at all requires traceability all the time.
Consider the common scenario where a business is trying to provide some semi-automation to a business process. Often businesses are trying to move from an informal email-based process to a web-based semi-automated process. Such a move can reduce human involvement and make the process faster and more efficient, leading to greater productivity. However, few participants realize the inherit traceability of email-based processes. Moving away from email-based to web-based, without proper consideration, can kill a company's ability to audit the process for correctness and completeness.
Today most web-based systems are built using SQL databases. However, the rigid nature of SQL-based systems creates a significant barrier for adding traceability to an existing SQL-based system. Traceability is not an add-on feature; it requires deep integration into every change and every transaction. This is something many SQL-based systems cannot easily provide.
Papers on digital traceability date as far back as 1986. However a quarter of a century later, there are still no standards for tracking digital conceptual objects (as there is in many other industries for the traceability of physical objects). Furthermore, following the digital explosion of data in the past decade and the increased reliance on information from the Web, there is a growing challenge that no one seems to know whether any of this information that is being collected is accurate or not.
This may change in 2013, as the W3C has been working on a general provenance information standard since 2009 that is scheduled for release next year. Specifically it is to support the widespread publication and use of provenance information of Web documents, data, and resources. Specifically, they are defining a provenance interchange language and methods to publish and access provenance metadata using this language.
The PROV specification (currently in last call) defines things as entities, activities, and agents. Entities are physical, digital, conceptual, or any other kinds of thing. Examples of such entities are a web page, a chart, and a spellchecker. Activities are how entities come into existence and how their attributes change. Agents takes a role in an activity such that the agent can be assigned some degree of responsibility for the activity taking place. An agent can be a person, a piece of software, an inanimate object, an organization, or other entities that may be ascribed responsibility.
Callimachus 0.18 will be the first Callimachus release to use this new PROV language to seamlessly describe all the activities that take place in the system. The Callimachus project was named after the man that created the first library catalogue; so it should not be too surprising that the project continues this legacy by creating metadata about every activity performed.
When a new resource is created, metadata is stored in the triple store to record the event. These activities are stored in the RDF store in a named graph, for example, when the RDF of a create form submits the following triples:

</sun> a </callimachus/Concept> , skos:Concept ;
    skos:prefLabel "Sun" ;
    skos:definition "The great luminary" .

Additional authorization information is copied from the class and parent folder that includes:

</sun> calli:reader </group/public> ;
calli:subscriber </group/everyone> ;
calli:editor </group/staff> ;
calli:administrator </group/admin> .

Callimachus uses this authorization information as a simple authorization model that is similar to the ACL of a file system. Here the groups or users of the system are assigned authorization rights to the resource. calli:reader provides read-only access, calli:subscriber provides access to the resources history and provenance data and grants the ability to discuss or comment on the resource, calli:editor provides the ability to change the resource, and calli:administrator provides the ability to change the authorization information.
The resource is also inserted into the parent folder using the following triple:

</> calli:hasComponent </sun> .

Callimachus provides a hierarchical view of resources that mimics the path segments of their identifier. This hierarchical relationship is captured using the inverse-functional calli:hasComponent property from the parent resource to its child. The reason this is an inverse-functional relationship is to require proper authorization to change the parent resource when adding a new child resource.
Finally, all these triples are combined and stored in the RDF store in an activity graph, along with the PROV metadata of the activity itself. The prov:wasGeneratedBy is a functional property that links the resource entities to the last activity that modified it. The prov:generated/prov:specializationOf links the activity to the resource entities it modified.

GRAPH </activity/2012/11/08/t1> {

    </activity/2012/11/08/t1> a </callimachus/Activity>, audit:RecentBundle ;
        calli:reader </group/everyone>, </group/staff>, </group/admin> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> ;
        prov:wasInfluencedBy </activity/2012/11/08/t0> .

    </activity/2012/11/08/t1#provenance> a prov:Activity ;
        prov:startedAtTime "2012-11-08T15:07:22.869Z"^^xsd:dateTime ;
        prov:wasAssociatedWith </user/james> ;
        prov:generated </activity/2012/11/08/t1#!/sun> ;
        prov:generated </activity/2012/11/08/t1#!/> ;
        prov:generated </activity/2012/11/08/t1#!/activity/2012/11/08/> ;
        prov:endedAtTime "2012-11-08T15:07:24.583Z"^^xsd:dateTime .

    </activity/2012/11/08/t1#!/sun>
        prov:specializationOf </sun> .
    </sun> a </callimachus/Concept>, skos:Concept ;
        calli:administrator </group/admin> ;
        calli:editor </group/staff> ;
        calli:reader </group/public> ;
        calli:subscriber </group/everyone> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> ;
        skos:definition "The great luminary" ;
        skos:prefLabel "Sun" .

    </activity/2012/11/08/t1#!/>
        prov:specializationOf </> ;
        prov:wasRevisionOf </activity/2012/11/08/t0#!/> .
    </>
        calli:hasComponent </sun> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> .

    </activity/2012/11/08/t1#!/activity/2012/11/08/>
        prov:specializationOf </activity/2012/11/08/> ;
        prov:wasRevisionOf </activity/2012/11/08/t0#!/activity/2012/11/08/> .
    </activity/2012/11/08/>
        calli:hasComponent </activity/2012/11/08/t1> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> .
}

Modifying a resource is a bit trickier as Callimachus stores both the previous version as well as the new version of the resource. If the clients send the following update to the server:

DELETE DATA {
    </sun> skos:definition "The great luminary" .
};
INSERT DATA {
    </sun> skos:definition "The lamp of day" .
};

Three triples are removed (not just one) from all graphs in the RDF store.

DELETE DATA {
    </sun> skos:definition "The great luminary" ;
        prov:wasGeneratedBy </activity/2012/10/02/t1> .

    </activity/2012/10/02/> prov:wasGeneratedBy </activity/2012/10/02/t1> .
};

The triple is then replaced with the following to keep the semantics of the first activity intact.

INSERT DATA {

    GRAPH </activity/2012/11/08/t1> {

        </activity/2012/11/08/t1#!/sun> audit:with </activity/2012/11/08/t2#5eef4c8f> .

In addition, a new named graph is created with the following, to represent this new activity.

GRAPH </activity/2012/11/08/t2> {

    </activity/2012/11/08/t2> a </callimachus/Activity> , audit:RecentBundle ;
        calli:reader </group/everyone>, </group/staff>,  </group/admin>;
        prov:wasGeneratedBy </activity/2012/11/08/t2#provenance> ;
        prov:wasInfluencedBy </activity/2012/11/08/t1> .

    </activity/2012/11/08/t2#provenance> a prov:Activity ;
        prov:startedAtTime "2012-11-08T15:19:31.199Z"^^xsd:dateTime ;
        prov:wasAssociatedWith </user/james> ;
        prov:generated </activity/2012/11/08/t2#!/sun> ;
        prov:generated </activity/2012/11/08/t2#!/activity/2012/11/08/> ;
        prov:endedAtTime "2012-11-08T15:19:31.295Z"^^xsd:dateTime .

    </activity/2012/11/08/t2#!/sun> ;
        audit:without </activity/2012/11/08/t2#5eef4c8f> ;
        prov:specializationOf </sun> ;
        prov:wasRevisionOf </activity/2012/11/08/t1#!/sun> .
    </activity/2012/11/08/t2#5eef4c8f>
        rdf:object "The great luminary" ;
        rdf:predicate skos:definition ;
        rdf:subject </sun> .
    </sun>
        prov:wasGeneratedBy </activity/2012/11/08/t2#provenance> ;
        skos:definition "The lamp of day" .

    </activity/2012/11/08/t2#!/activity/2012/11/08/>
        prov:specializationOf </activity/2012/11/08/> ;
        prov:wasRevisionOf </activity/2012/11/08/t1#!/activity/2012/11/08/> .
    </activity/2012/11/08/>
        calli:hasComponent </activity/2012/11/08/t2> ;
        prov:wasGeneratedBy </activity/2012/11/08/t2#provenance> .
}

Callimachus also allows users to upload RDF triple files (rdf+xml and turtle). When an entire RDF triple file is uploaded the metadata stored is slightly different. If the file data.rdf is uploaded to the home folder, all the triples in the file are inserted into the named graph </data.rdf>. In addition, the following named graph is created and the binary file is stored permanently on disk associated with the same activity identifier.

GRAPH </activity/2012/11/08/t3> {

</activity/2012/11/08/t3> a </callimachus/Activity>, audit:RecentBundle ;
    calli:reader </group/everyone>, </group/staff>, </group/admin> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance> ;
    prov:wasInfluencedBy </activity/2012/11/08/t1> ;
    prov:wasInfluencedBy </activity/2012/11/08/t2> .

</activity/2012/11/08/t3#provenance> a prov:Activity ;
    prov:startedAtTime "2012-11-08T15:36:40.039Z"^^xsd:dateTime ;
    prov:wasAssociatedWith </user/james> ;
    prov:generated </activity/2012/11/08/t3#!/data.rdf> ;
    prov:generated </activity/2012/11/08/t3#!/> ;
    prov:generated </activity/2012/11/08/t3#!/activity/2012/11/08/> ;
    prov:endedAtTime "2012-11-08T15:36:40.951Z"^^xsd:dateTime .

</activity/2012/11/08/t3#!/data.rdf> ;
    prov:specializationOf </data.rdf> .

</data.rdf> a </callimachus/NamedGraph>, sd:NamedGraph, foaf:Document ;
    calli:administrator </group/admin> ;
    calli:editor </group/staff> ;
    calli:reader </group/public> ;
    calli:subscriber </group/everyone> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance> ;
    dcterms:identifier "data" .

</activity/2012/11/08/t3#!/>
    prov:specializationOf </> ;
    prov:wasRevisionOf </activity/2012/11/08/t2#!/> .

</>
    calli:hasComponent </data.rdf> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance> .

</activity/2012/11/08/t3#!/activity/2012/11/08/>
    prov:specializationOf </activity/2012/11/08/> ;
    prov:wasRevisionOf </activity/2012/11/08/t2#!/activity/2012/11/08/> .

</activity/2012/11/08/>
    calli:hasComponent </activity/2012/11/08/t3> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance>  .
}

All of this metadata is readily available in the history tab or by following the rel=version-history link in the page, other atom feed, or Link header in an OPTIONS response. The metadata is formatted as an html list or as an atom feed. Both of these representations include links to the PROV activity that modified the resource.
These named metadata activity graphs together provide an audit trail of all the entities in the system in a transparent way, linked together with the common prov:used relationships. This allows the software developers and the stakeholders to focus on their value-added features.
More information about Callimachus can be found at the project page at http://callimachusproject.org/.

References for the post include:
Tim Berners-Lee, W3C Chair, Web Design Issues, September 1997
John Sheridan, UK National Archives, data.gov.uk, February 2010
Jill Mesirov, Chief Informatics Officer of the MIT/Harvard Broad Institute, in Science, January 2010
Luc Moreau, University of Southampton, in The Foundations of Provenance on the Web, November, 2009
Vinton Cerf, Internet pioneer, in Smithsonian's "40 Things you need to know about the next 40 years" issue, July, 2010
Jeff Jarvis, media company consultant and associate professor at the City University of New York's Graduate School of Journalism, in The importance of provenance on his BuzzMachine blog, June, 2010

Running less.js on the JVM Server

2012-06-05T18:39:00.000-07:00

less.js is a css templating language with a javascript file to convert templates into a CSS file.

The less.js distribution includes a Rhino patch to run less.js from the command line using rhino. less.js no longer produces a Rhino version, but the patch remains available in the master branch.

less.js 1.3.0 uses ECMA-5 and will attempt to upgrade the Object and Array prototypes if run in a non-ECMA-5 environment. This prevents the script running in many ECMA environments that ship with jdk6.

There are two popular jars available that provide a Java API for less.js. Both of them use Rhino to run the script in the JVM.

Asual`s has been around longer and hacks the rhino patch to run as a library. Asual requires the latest version of Rhino.

lesscss-java claims to be the official java version and includes envjs (mimic a browser's script environment for running html apps offline). This allows the library to run less.js just as it would run in the browser. Envjs requires the latest version of rhino.

If you try and run less.js using the ECMA script in jdk6, you may find that the core object/prototypes are sealed and cannot be extended.

The version of ECMA script on Mac jvms seems to be only ECMA-3.1 or JavaScript 1.5. To run less.js you have to patch it to use utility functions instead of ECMA-5 functions. less.js also requires window and document objects to function. However, you can get away with the following environment.

        var window = {};
        var location = {port:0};
        var document = {
            getElementsByTagName: function(){return []},
            getElementById: function(){return null}
        };
        var require = function(arg) {
            return window.less[arg.split('/')[1]];
        };

less.js uses XMLHttpRequest to import referenced documents. If you want to load other files yourself, best to override the window.less.Parser.importer function.

The function takes (path, paths, callback, env), where path is the import url, paths is an array (passed in from the constructor options), callback is a function to send the results, and env is the constructor options. The callback takes (e, root, content), where e is a thrown error, root is the parse tree and content is the file's contents (for error reporting). Here is a skeleton of the code you would need to run on jdk6.
        var contents = {};
        window.less.Parser.importer = function(path, paths, callback, env) {
            if (path != null) {
                var uri = new java.net.URI(paths[0]).resolve(path).normalize();
                var content = ... // TODO read the uri content as a string
                var dir = uri.resolve(".").normalize();
                var file = dir.relativize(uri).toASCIIString();
                contents[file] = content;
                var parser = new window.less.Parser({
                    optimization: 3,
                    filename: file,
                    opaque: true,
                    paths: [dir.toASCIIString()]
                });
                parser.imports.contents = contents;
                parser.parse(content, function (e, root) {
                    if (e) throw e;
                    callback(e, root, content);
                });
            }
        };

To help debug less.js errors the above has a fix for issue 592. All new window.less.Parser have a imports.contents map and this map needs to have the basename of any imported file to resolve error locations. If the map does not contain the basename, a charAt error is thrown.

If running server side you may also be interested in this patch to inline both less and CSS files. The opaque flag above turns this on.

Blob Store

2012-01-15T11:39:00.000-08:00

In release 2.0-beta14 (I know, this is the late beta release) AliBaba introduced a new BLOB store. The blob store integrates with the RDF repository ObjectRepository to synchronize transactions. This allows both the BLOB store and the RDF store to be isolated and always consistent with one another. This is done using two-phase commit transactions in the BLOB store.

The BLOB store also has a few other advantages over a traditional file system. First every change is isolated until it is closed/committed. This prevents other readers from see an incomplete BLOB and help prevent inconsistency between the BLOB and RDF stores. In additional, as disk space is generally considered cheap, all past versions of BLOBs are keep on disk by default. This allows any previous versions to be retrieved (and restored) using the API.

The BLOB store API is fairly simple. Here what some code might look like using the BLOB store.

BlobStoreFactory factory = BlobStoreFactory.newInstance();

BlobStore store = factory.openBlobStore(new File("."));

String key = "http://example.com/store1/key1";

BlobObject blob = store.open(key);

OutputStream out = blob.openOutputStream();

try {

// write stream to out

} finally {

out.close();

}

InputStream in = blob.openInputStream();

try {

// read stream from in

} finally {

in.close();

}

More API options can be see in the JavaDocs:

http://www.openrdf.org/doc/alibaba/2.0-beta14/apidocs/org/openrdf/store/blob/package-summary.html

Web Developer Review of BlackBerry PlayBook

2011-06-02T10:51:00.000-07:00

Most reviews for the PlayBook focus on the same issue: very few downloadable apps in app world. As a web developer - I couldn't care less.

First Impression

Websites render fast, and due to the high dpi, look really nice. With its compact form, it fits well in my hands, easy to type and very portable. With a flash plugin included, streaming video is smooth and full screen works. Videos look really slick when plugged into a HD TV. Each app can only open one window, so the browser supports tabs and allows you to keep multiple tabs open at once.

Honeymoon Ends

Tabbed browsing works on the desktop, but not on the PlayBook. Only the open tab can be actively loading. Opening a new tab before the page loads can cancel the page from loading. Opening a new tab while watching video pauses the video. This makes watching commercials really frustrating because you can't turn away or it will pause. Watching videos in the browser is also frustrating, as after five minutes the PlayBook goes into suspend. (There are some tricks to stop this, but not if in full screen mode.)

In addition, despite all the fuss about multitasking, the PlayBook can't multitask. Most specifically you can only have one web page active at a time, and this includes webapps.

Surprisingly, the PlayBook is much less web developer friendly than I expected. The script engine is incomplete. There is no offline support for webapps. There is no support for turning a webapp application into a chromeless app. Webworks development requires a series of confusing bat commands that don't work the first time. All of this makes it really hard to develop for the PlayBook.

What's Left

The apps I use include Browser, Wi-Fi Sharing, Word To Go, Slides To Go, Videos, Pictures, aVNC, and ReelPortal. All of them work, but I expected more from almost every one of them.

All that being said, I am going to hold on and put up with the current limitation of the PlayBook. I really like having a portable web browser, and I believe there is still a lot of potential for this device. I am looking forward to seeing what the next software update has to offer.

Five Steps to a More Secure Web App

2011-02-14T06:39:00.000-08:00

There are a number of different authentication methods available to choose from when launching (or updating) a Web application. Choosing the wrong method can leave the system (or worse, the users) vulnerable to cyber attacks or identity theft.

Below are five rules that should always be obeyed (regardless of the method). By considering these rules and how your users will use your system, you can better understand the security requirements of your Web application and can choose the right method.

1) Never send clear user passwords over an unencrypted channel.

When passwords are sent over an unencrypted channel, anyone who has access to the network (and a little know how) can read them. This should never be done with user supplied passwords (not even for intranet websites). Users often use the same password for multiple systems. Exposing a user's password in one system puts them at risk in another.

Both basic authentication and form-based authentication are vulnerable to this and should never be used when users can choose their own passwords. Digest authentication and encrypted logins do not send clear passwords, and can be used when users can choose their own passwords.

HTTP basic authentication and HTML form-based logins can be used in secure networks to restrict Web access as long as the passwords are pseudo random, unpredictable, and unique across other systems.

For systems that allow user created passwords, care must be taken to ensure the passwords are not readable by others by using HTTPS or digest during logins.

2) Never send session tokens unencrypted over a shared network.

Unencrypted session tokens are visible to anyone who has access to the network. Although session tokens don't expose the user's password, they do allow hijacking accounts with unlimited access. This should never be used over a public wifi network (or other shared network) to access private information or make changes.

Cookie based authentication over HTTP is vulnerable to this. Digest authentication and HTTPS sessions are not vulnerable.

Digest authentication uses a unique "salt" for every request and digest systems prevent the same "salt" being used more than once (although this is optional). By never using and never allowing the same authentication token twice, digest authentication prevents account hijacking.

HTTPS requests are encrypted and prevent eavesdropping from others on the network, preventing access to any request tokens that might be present.

Only allow HTTPS using keys from a certificate authority, HTTPS with self signed keys, HTTPS with mixed content or digest authentication should be used to exchange private information over shared networks.

For more information about the vulnerabilities of using session tokens see Weaning the Web Off of Session Cookies.

3) Always verify information sent over an insecure network.

Insecure networks may be vulnerable to malicious attacks such as DNS posioning, or a trojan Web proxy. These attacks are often called man-in-the-middle and can manipulate the content from the server before it reaches the client (and vice-versa).

Most unencrypted HTTP communication is vulnerable to this. Even mixed content of both HTTPS and HTTP is vulnerable to man-in-the-middle because compromised HTTP content can read and manipulate HTTPS content.

Although digest authentication includes an optional integrity check to prevent this, most browsers either don't check or don't indicate to the user if the content has been verified.

All Web browsers verify HTTPS content (when not mixed) and this should be used for insecure networks. For mobile devises that often connect from potentially insecure networks HTTPS (self signed or CA signed) should be enabled by default for any private information.

4) Never give confidential information without verifying authenticity of the server.

Well disguised URLs and familiar looking pages can trick users into visiting and pseudo-logging into illegitimate websites. If your website asks your users for confidential information, ensure there is a clear way for your users to verify the authenticity of the site before logging in. Otherwise, your users might give confidential information to untrustworthy third parties without even knowing it.

HTTPS using previously distributed keys (such as keys from an established certificate authority) allow the user to verify the organization in their browser (near the address bar). This allows the user to quickly verify authenticity of the server.

HTTPS with self sign certificates cannot be used to verify authenticity unless they have been previously distributed through a secure channel.

Although digest authentication can include authentication-info to verify authenticity, most browsers either ignore it or don't indicate to the user when the site is verified. However, most browsers do show the host name and realm to the user for review before logging in and this does give the user a chance to check to domain name before logging in.

Always use HTTPS for confidential, or sensitive information.

5) Never access sensitive information over an unencrypted channel.

HTTP traffic can be viewed by any who has access to the network. It is vital that all sensitive information is never sent over unencrypted HTTP. Sensitive information should always use HTTPS.

Only exclusively HTTPS with known certificates should be used to exchange sensitive information with its users.

Always use HTTPS for confidential, or sensitive information.

In summary

By obeying these five rules you can pick the right authentication method and prevent your system and users from being vulnerable to cyber attacks and identify theft.

Status Code 200 vs 303

2010-11-28T16:37:00.000-08:00

The public LOD has been dominated by discussions on using 303 in response to a GET request for distinguishing between the requested resource identifier, and a description document identifier.

Some resources can be represented completely on the Web. For these resources, any of their URLs can be used to identify them. This blog page, for example, can be identified by the URL in a browser's address bar. However, some resources cannot be completely viewed on the Web - they can only be described on the Web.

The W3C recommends responding with a 200 status code for GET requests of a URL that identifies a resource which can be completely represented on the Web (an information resource). They also recommend responding with a 303 for GET requests of a URL that identifies a resource that cannot be completely represented on the Web.

Popular Web servers today don't have much support for resources that can't be represented on the Web. This creates a problem for deploying (non-document) resource servers as it can be very difficult to set-up resources for 303 responses. The public LOD mailing list has been discussing an alternative of using the more common 200 response for any resource.

The problem with always responding to a GET request with a 200 is the risk of using the same URL to identify both a resource and a document describing it. This breaks a fundamental Web constraint that says URIs identify a single resource, and causes URI collisions.

It is impossible to be completely free of all ambiguity when it comes to URI allocation. However, any ambiguity can impose a cost in communication due to the effort required to resolve it. Therefore, within reason, we should strive to avoid it. This is particularly true for Web recommendation standards.

URI collision is perhaps the most common ambiguity in URI allocation. Consider a URL that refers to the movie The Sting and also identifies a description document about the movie. This collision creates confusion about what the URL identifies. If one wanted to talk about the creator of the resource identified by the URL, it would be unclear whether this meant "the creator of the movie" or "the editor of the description." Such ambiguity can be avoided using a 303 for a movie URL to redirect to a 200 of the description URL.

As Tim Berners-Lee points out in an email, even including a Content-Location in a 200 response (to indicate a description of the requested resource) "leaves the web not working", because such techniques are already used to associate different representations (and different URLs) to the same resource, and not the other way around.

Using any other 200 status code for representations that merely describe a resource (and don't completely represent it) causes ambiguity because Web browsers today interpret all 200 series responses (from a GET request) as containing an complete representation of the resource identified in the request URL.

Every day, people bookmark and send links of documents they are viewing in a Web browser. It is essential that any document viewed in a Web browser has a URL identifier in the browser's address bar. Web browsers today don't look at the Content-Location header to get the document URL (nor should they). For Linked Data to work with today's Web, it must keep requests for resources separate from requests for description documents.

The community has voiced common concerns about the complexity of URI allocation and the use of 303s using today's software. The LOD community jumped in with a few alternatives, however, we must consider how the Web works today and be realistic on further Web client expectations. The established 303 technique works today using today's Web browsers. 303 redirect may be complicated to setup in a document server, but let's give Linked Data servers a chance to mature.

HTML-Oriented Development

2010-09-13T10:21:00.000-07:00

The heart of all Web applications is the user interface (UI) design - this is what its user interact with. As any consultant knows: clients are more satisfied with a well designed UI and mediocre business logic then they are with a poorly designed UI with minimal transparency and fully automated business rules.

What is surprising (when you think about it) is that most Web application frameworks orient around the business model and treat the HTML like a second class citizen. The conceptual model may be important, but even more important is the representation of the model in HTML. Good UIs provide the user with full transparency to the state and operations of the underlying model. It doesn't matter how well the model is if the HTML is too confusing or too obscure; users will avoid using it.

The HTML of Web applications is surprisingly rich with domain concepts. Most well designed UIs contain all the classes, relationships, and attributes found in the underlying model and present them to the user in a language everyone involved can understand. There is a lot of emerging standards that can help turn this human readable data in HTML into machines readable data using RDFa, microformats, or microdata.

Recently, David Wood and I started the project Callimachus; it has taken a different approach to Web application design/development. Callimachus reads the domain model from your HTML templates! In Callimachus there is no need to maintain multiple models, no SQL schema, no query languages, no object-relation mapping, it's all embedded in HTML using RDFa.

RDFa allows your HTML to include resource identifiers, their relationships, and properties using additional attributes such as: about, rel, and property. Consider the following HTML snippet. Using RDFa the data is readable by both humans and machines alike. It says that James Leigh knows David Wood using the relationship "foaf:knows" and the property "foaf:name".

<div about="james">
<span property="foaf:name">James Leigh</span>
knows
<div rel="foaf:knows" resource="david">
<span property="foaf:name">David Wood</span>
</div>
</div>

Written using a Callimachus HTML template it might look like the snippet below. Here is an embedded query asking who knows "david" and what is their name.

<div about="?who">
<span property="foaf:name" />
knows
<div rel="foaf:knows" resource="david">
<span property="foaf:name" />
</div>
</div>

Callimachus provides the framework necessary to create HTML templates to query, view, edit, and delete resources. This technique allows Web developers to save time and maintenance costs by applying the DRY principle (Don't Repeat Yourself) to Web application development.

For more information about Callimathus see http://callimachusproject.org or turn into my live Webcast on Wednesday at http://www.wilshireconferences.com/semtech2010/email/email-webcast-091510.html

The Future of RDF

2010-05-27T06:36:00.000-07:00

At the end of June, immediately after SemTech, I'll be attending the W3C RDF Next Step Workshop. This workshop has been set up with the goal of gathering feedback from the Semantic Web community to determine if (and how) RDF should evolve in the future. I'll be presenting two papers with David Wood which I hope will generate good discussions...(To review the papers or for more information on the workshop, go to NextStepWorkshop.)

The first paper I'm presenting will show a new RESTful RDF Store API supporting named queries and change isolation. (I blogged about this earlier this year.) This proposed API would combine basic CRUD operations over RDF constructs (graphs, services and queries) and mandate RDF descriptions of services. With the ability to modify an RDF store's state in SPARQL 1.1 comes the challenge of managing store versions and the need to manage them (and their differences) over HTTP.

The other paper is a proposed alternative handling of rdf:List in SPARQL. The way we currently deal with ordered collections in RDF, whether through tools or in SPARQL, is so difficult that it limits adoption of RDF. So much of data retrieval, which is currently dominated on the Web by XML, includes the notion of ordered collections - RDF must align the RDF representation with the conceptual notion of ordered collections if it has a chance of making inroads into already established networks.

Where do you think RDF needs to go in the future? Does it need to change if it is going to stay viable?

Reinventing RDF Lists

2010-03-08T09:06:00.000-08:00

Last month the SW interest group discussed alternatives to containers and collections as part of a discussion around what the next generation of RDF might look like. Below is my opinion on the matter.

RDF's simplistic approach makes it possible to encode most data structures, both simple and complex. The challenge people have with RDF, coming from other Web formats, is the lack of basic ordered collections (a concept common in XML). In RDF you are forced into a linked list structure just to preserve resource order. The linked list structure known as rdf:List is difficult to work with and highly ineffective within modern RDF stores.

Most RDF formats provide syntactic sugar to make it easier to write rdf:Lists. In turtle this is done using round brackets (parentheses); in RDF/XML this is done using the parseType collection attribute. However, because rdf:List is not a fundamental concept in RDF, no RDF store implementation preserves them, instead opting to use the fundamental triple form -- a linked list.

RDF is made of the following fundamental concepts: URI, Literal, and Blank Node. A fundamental list concept should be added to make it easier and more efficient to work with ordered collections. This would not have a significant effect on RDF formats, as their syntax would not change, but would have a significant impact on the mindset of RDF implementers.

With this change RDF implementers would strive to ensure that lists are implemented efficiently and provide convenient operations on them, just as they would other fundamental RDF concepts. The triple (linked list) form should be kept for compatibility with RDF systems that don't preserve lists, but the goal would be that RDF systems would not be obligated to provide a triple linked list form that has proven to be ineffective.

By making lists a fundamental RDF concept, there is no required change for RDF libraries to continue to be compatible with existing standards. Most libraries and systems may already understand list short hand and some may also preserve it.

Improving RDF/XML Interoperability

2010-03-01T07:06:00.000-08:00

All the permitted variations in RDF/XML make working with it using XML tools difficult, at best. Most of the time assumptions are made about the structure of the RDF/XML document. These are based on particular RDF/XML implementations. However, there is no standard spec that says what this simplified structure should be. The next generation of RDF specs should correct this and create a subset of RDF/XML for working with XML tools.

A good place to start is with the document Leo Sauermann has started. I like the design, but feel the rules could be improved, based on my experience.

The design of SimpleRdfXml, as proposed by Leo, is:
1. be compatible with RDF/XML
2. but only a subset
3. restrict to simplicity

The rules I try to follow when serializing RDF/XML for use with XSLT are:

1. No nested elements (references to resources must be done via rdf:resource or rdf:nodeID).
2. No property attributes.
3. All blank nodes identified by rdf:nodeID.
4. Only full URIs, no relative URIs.
5. No type node elements, always as rdf:type element.
6. The same rdf:about maybe repeated on multiple nodes.
7. Never use rdf:ID.
8. Always use rdf:li when possible.
9. Always use rdf:parseType="Collection" when possible.
10. All rdf:XMLLiterals written as rdf:parseType="Literal".
11. Never use rdf:parseType="Resource".
12. White-space is preserved within literal tags.

By standardizing on these (or another RDF/XML subset), interoperability between XML and RDF tools becomes possible. This allows existing shops to reuse their current XML skills to work with RDF, easing their transition.

Beyond the SPARQL Protocol

2010-02-17T04:34:00.000-08:00

The SPARQL Protocol has done a lot to bring different RDF stores together and make interoperability possible. However, the SPARQL Protocol does not encompass all operations that are typical of an RDF store. Below are some ideas, that would extend the protocol enough that it could become a general protocol for RDF store interoperability.

One common complaint is the lack of direct support for graphs. This is partly addressed in the upcoming SPARQL 1.1, which includes support for GET/PUT/POST/DELETE on named graphs. However, it is still missing the ability to manage these graphs. What is still needed is a way to assign a graph name to a set of triples as well as a vocabulary to search and describe the available graphs. The service could accepted a POST request of triples and responded with the created named graph to support construction. The graph metadata could be available in a separate service or as part of the existing SPARQL service, made available via SPARQL queries.

The use of POST in SPARQL ensures serializability of client operations. However, it prevents HTTP caching (with reasonably sized queries), which is necessary for Web scalability. This can be rectified by introducing standard named query support. By providing the client with the ability to create and manage server side queries (with variable bindings), many common operations can become cachable. These named queries can be described in their own service or as part of the existing SPARQL service. The named query metadata would include optional variable bindings and cache control settings. The queries could then be evaluated on HTTP GET to the URI of the query name, using the configured cache control, enabling Web scalability.

Another requirement for broad RDF store deployments is the ability to isolate changes. Many changes are directly dependent on a particular state of the store and cannot be represented in a update statement. Although SPARQL 1.1 allows update statements to be dependent on a graph pattern, many changes have indirect relationships to the store state and cannot be related directly within a WHERE clause.

To accommodate this form of isolation, separate service endpoints are needed to track the observed store state and the triples inserted/deleted. Metadata about the various available endpoints could be discoverable within each service (or through a dedicated service). This metadata could include such information as the parent service (if applicable) and the isolation level used within the endpoint.

To support serializable isolation, each endpoint would need to watch for Content-Location: headers, which would indicate the source of the update statement in the POST requests. When such a update occurs, the service must validate that the observed store state in the source endpoint is the same as the store state in the target endpoint before proceeding.

By standardizing graph, query, and isolation vocabularies within the SPARQL protocol, RDF stores would be much more appealing to a broader market.

RDFa Change Sets

2010-02-09T07:45:00.000-08:00

With so many sophisticated applications on the Web, the key/value HTML form seems overly simplistic for today's Web applications. The browser is increasingly being used to manipulate complex resources and an increasingly popular technique for encoding sophisticated data in HTML is RDFa.

RDFa defines a method of encoding data within the DOM of an HTML page using attributes. This allows complex data resources to be connected to the visual aspects that are used to represent them. RDFa provides a standard way to convert an HTML DOM structure into RDF data for further processing.

Instead of encoding your data in a key/value form, encode your data in RDFa and use DHTML and AJAX to manipulate the DOM structure and in turn manipulate the data. The conversion from HTML to data can be done on the server or client using existing libraries.

There are a few ways that RDFa can help with communication to the server. The simplest would be to send back the entire HTML DOM for RDFa parsing on the server. However, an HTML page might contain an excessive amount of bulk and therefore this would not be appropriate as a general solution. Instead, using an RDFa parser on the client, the resulting RDF data can be sent to the server, ensuring only the data is transmitted back. This would reduce excessive network traffic and move some of the processing to the client.

In a recent project, we went further and used rdfquery to parse before and after snapshots on the client to prepare a change-set for submission back to the server. In JavaScript, the client prepared an RDF graph of removed relationships and properties and an RDF graph of added relationships and properties. These two graphs represent a change-set. By using change-sets throughout the stack, enforcing authorization rules and tracking provenance became much more straight-forward. Change-sets also gave more control over the transaction isolation level, by enabling the possibility of merging (non-conflicting) change-sets. Creating change-sets at the source (on the client) eliminated the need to load/compare all properties on the server, making the process more efficient and less fragile.

RDFa on the client and submitting change-sets can help streamline data processing and manipulation and avoid much of the boilerplate code associated with mapping data from one format to another.

Why isn't the Web Object-Oriented?

2009-11-02T09:00:00.000-08:00

A big part of the Web is web services, but often these services are not modelled using an object oriented paradigm, even though it is well suited for complex behaviours. Web services are often modelled using a simple request/response paradigm or a service oriented paradigm using a RESTful framework, but many of these resource oriented frameworks can be adapted to support some object oriented concepts.

Many people think of classes and methods when they think of Object-Oriented Programming (OOP). However, I like to think of OOP as message passing with class specialization. This is particularly helpful when designing Web services, which also use a message passing model. Even RESTful Web services use forms of message passing between nodes.

Consider the simple URL below. When followed a GET request is sent to a Google server. This can be thought of as sending Google's search object a message with the given search term parameter (using the Google network as the authority). The search object (in this case a proxy) responds with an HTML page with the search results.

   Object Authority
 _________|_________
/                   \
http://www.google.com/search?q=Why+isn%27t+the+Web+Object-Oriented%3F
\__________________________/  \_____________________________________/
             |                                  |
       Object Identity                       message

All HTTP requests can be thought of as messages being sent to remote objects. The request method, query parameters, headers, and body make up the message, and the request URI identifies the message's target object. The HTTP response is the message's return value.

However, OOP is more than simply message passing. A big part of OOP is the association of behaviour with data. The relationship between behaviour and data drives at the difference between service oriented and object oriented paradigms. A service oriented model is like an object oriented model, but all objects are stateless singletons with their own unique behaviour. Because of this, pure service oriented systems can be more efficient (less data access), but is more expensive to maintain, as each service must consider all possible variations at once. In contrast, OOP supports behaviour specialization and can more closely reflect the structure of systems 'in the real world'.

While many services are identified by a single request URI (scheme+authority+path), most RESTful frameworks allow data to also be associated with the URI. JAX-RS, for example, allows path parameters that are often populated with a unique entity ID. By incorporating the entity ID in the URI, data is associated with the behaviour in the same way as in an OOP paradigm. However, most RESTful frameworks fail to provide any support for object or resource behaviour specialization -- a feature that is incredibly powerful in class-based OOP.

The Web is actually fairly close to seamlessly supporting an object-oriented paradigm. Processing efficiency seems to be the only barrier. However, with the growing costs of maintaining complex Web systems, I'm not sure how long this argument can hold up. When do you think we'll have an object oriented Web framework and what would it look it?

The Complicated Software Stack

2009-10-26T07:23:00.000-07:00

To aspiring Web application developers or people looking to put together their own Web application: the road to building a modern working Web application is a long and complicated journey.

Today's Web application developer is nothing short of a jack-of-all-trates, requiring deep knowledge of everything from HTML and CSS to Java and SQL. Everything from common CRUD tasks to sophisticated work-flows requires knowledge of half a dozen computer languages along with their quirks and variations across platforms and applications.

Today's software is built using a mix of programming paradigms and data models. Every level in the software stack requires explicit data mapping between paradigms. Many Web applications include the following levels in their software stack:
• Relational for persistence,
• Object oriented (class-based) in the model,
• Aspects peppered throughout,
• Resource (or activity) oriented Web services,
• Functional template engines,
• Markup using key/value pairs, and
• Prototype based objects for UI behaviour.

The above complication comes at a price. Software takes longer to develop and is more expensive to maintain than it used to be. This is causing a greater divide between small tools and large software systems.

Applications, like Microsoft Excel, which combine data processing and persistence using a consistent programming paradigm, have grown in popularity as a cheap alternatives to the complexity of modern Web applications.

While the market for Web applications has grown, the scope has decreased, favouring large high volume systems. Smaller Web applications are too often over-architected and over-budget. There is a large (and growing) opportunity for software vendors to fill this divide and create a new platform that combines data processing and persistence, using a single programming paradigm, for Web applications.

Can Web applications be built to use a single programming paradigm?

Chrome Frame: Love It Or Hate It

2009-09-29T17:18:00.000-07:00

Image via Wikipedia

Google has clearly struck a nerve among browser makers with the announcement of Chrome Frame. Microsoft was awfully quick to down play any thoughts about installing Chrome as a plugin for IE considering it refers to the WebKit's market share as a "rounding error". Mozilla has also recently become vocal about putting down any notion of a browser-in-a-browser solution. This is all quite bizarre as both of these players are big into browser plugins of some form or another. Microsoft with its alternative Silverlight application engine and Mozilla, which acquired its market share through extensible plugins of its own.

It is actually quite common to have multiple rendering engines within the same browser: flash, silverlight, and Java being the most obvious, but there is more. IE has had a number of browser plugins in the past, including Mozilla ActiveX Control and Google's SVG plugin. IE8 ships with multiple rendering engines that get triggered based on HTML tags or user actions. Nescape 7, although short lived, shipped with both the Gecko and IE rendering engines. Mozilla has previously encouraged this type of action in the past, with Google's ExCanvas and Mozilla's, now inactive, Screaming Monkey initiative. Today Mozilla still makes IE available as a Firefox plugin.

I think it is ridiculous to ask users to only use particular browsers for particular websites. Choosing the best available rendering engine should be the choice of the website authors and I would welcome a mega-browser that seamlessly switches between Gecko, Trident, WebKit, and Presto based on the preferred engine of the author. More precisely, I trust website authors will choose standard compliant engines more then I trust users to choose standard compliant browsers.

I find Mozilla's reaction particularly interesting as it comes at a time when I find myself, an old Gecko fan, looking at WebKit more seriously. Recently in a project, due to an old outstanding Gecko issue, I had to put Firefox support on hold while Trident, Presto and WebKit continued to operate without much trouble.

I know it is true with IE, but perhaps it is true with Mozilla as well, that they view the engine as just something a browser needs and not a feature in and of itself. Perhaps I have been wrong all along and XUL is actually Mozilla's doom.

Accept Headers: In The Wild

2009-09-20T09:30:00.000-07:00

As web agents (including browsers) become more diverse there is an increasing need to distinguish between their types. The User-Agent header can be used for this task, but requires the server to know in advance all the possible agents and what type they are. This is not possible as both the diversity and quantity of agents is growing too quickly for any single registry to track.

According to the HTTP specification, the Accept header can be used to determine the type of agent. For example:
• HTML browsers should include "text/html" within the Accept header,
• XHTML browsers include "application/html+xml",
• RDF browsers include "application/rdf+xml",
• XSLT agents include "application/xml",
• PDF agents include "application/pdf",
• Office suites include "application/x-ms-application" or "application/vnd.oasis.opendocument", and
• JavaScript libraries include "application/json"

This allows the server to better redirect the agent to an appropriate resource.

Obviously, if a service will only serve HTML browsers, the type of agent is not necessary, as is the case in the Web 1.0 days when everything on the Web was HTML. However, as HTTP is becoming a more popular protocol for non-HTML communication, the need for distinguishing between types of agents is becoming important.

Consider the situation when an abstract information resource (like an order or an account) is identified by a URL. When the server receives a request for an abstract information resource, it needs to know which type of agent is requesting it, so it can better redirect the agent to an appropriate representation. If the agent is an HTML browser, the server should redirect to an html page displaying the order or account information; if a JavaScript library, the server should redirect to a json dump of the order/account summary; if a PDF agent, the server should redirect to a order/account summary report; if an office suite, the server should redirect to a spreadsheet of the details.

This works very well in theory, but because the Web was built with only HTML browsers in mind, most browsers don't properly implement the HTTP specification (because they don't have to). Even worse is that most non-HTML browser agents either don't include an Accept Header at all or use */* and say nothing about the type of agent. Below are some of the default accept headers from popular user agents on the web.

FF3.5 is an HTML and XHTML browser first, XML/XSLT agent second
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

IE8 is a media viewer (apparently)
image/gif, image/jpeg, image/pjpeg, image/pjpeg, application/x-shockwave-flash, */*

IE8+office is a media viewer and office suite
image/gif, image/jpeg, image/pjpeg, application/x-ms-application,
application/vnd.ms-xpsdocument, application/xaml+xml,
application/x-ms-xbap, application/x-shockwave-flash,
application/x-silverlight-2-b2, application/x-silverlight,
application/vnd.ms-excel, application/vnd.ms-powerpoint,
application/msword, */*

Chrome3 is an XHTML and XML/XSLT agent first, HTML browser second, and text viewer third.
application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5

Safari3 is an XHTML and XML/XSLT agent first, HTML browser second, and text viewer third.
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5

Opera10 is an HTML and XHTML browser first, XML/XSLT agent second.
text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1

The MSN bot is an HTML browser, text viewer, xml client and application archiver.
text/html, text/plain, text/xml, application/*, Model/vnd.dwf, drawing/x-dwf

Google search bot is a jack of all agents, master of none
*/*

Yahoo search bot is a jack of all agents, master of none
*/*

AppleSyndication is a jack of all agents, master of none
*/*

See Also:
Unacceptable Browser HTTP Accept Headers (Yes, You Safari and Internet Explorer)
WebKit Team Admits Error, Downplays Importance, Re: 'Unacceptable Browser HTTP Accept Headers'

Dereferencable Identifiers

2009-08-24T06:58:00.000-07:00

A document URL is a dereferencable document identifier. We use URLs all over the Web to identify HTML pages and other web resources. When you can't give out a brochure you can share a URL. Instead of sending a large email attachment, you might just send a URL instead. Rather then creating long appendixes, you can simply link to other resources. It is so much more useful to pass around URLs then it is trying to transfer entire documents around.

This model has worked well for document and is now being adopted for other type of resources. With the popularity of XML, using URLs to identify data resources is now commonplace. Rather then passing around a complete record, agents pass around an identifier that can be used to lookup the record later. By using a URL as the identifier these agents don't need to be tied to any single dataset and are much more reusable.

From the HTML5 standardization process has risen the debate on the usefulness of URLs as model identifier. Most people agree that a URL is a good way to identify documents, web resources and data resources. However, the debate continues on the usefulness of using a URL as an identifier within a model vocabulary. One side claims that a model vocabulary should be centralized and therefore does not require the flexibility of a URL. The other side claims the model vocabulary should be extensible and requires a universal identifying scheme that URLs provide.

To understand the potential usefulness of using a URL as a model identifier, consider the behaviour difference between a missing DTD and a missing Java class. A DTD is identified using a URL and a Java class is not. When an XML validator encounters a DTD it does not understand it dereferences the identifier and uses the resulting model to process the XML document. When a JVM encounters a Java class it does not understand it throws an exception, often terminating the entire process. Now consider how much easier it would be to program if a programming environment used URLs for classes and model versions. Dependency management would become as simple as managing import statements. As the Web becomes the preferred programming environment of the future, we must consider these basic programming concerns.

Although I enjoy working in abstractions, I certainly understand how things always get more complicated when you go meta: using URLs to describes other URLs. However, this complexity is essential to continue to maintain the flexibility and extensibility of the Web.

See Also: HTML5/RDFa Arguments

97 Things Every Project Manager Should Know

2009-08-23T11:47:00.000-07:00

If the projects you manage don't go as smoothly as you'd like, 97 Things Every Project Manager Should Know offers knowledge that's priceless, gained through years of trial and error. This illuminating book contains 97 short and extremely practical tips -- whether you're dealing with software or non-IT projects -- from some of the world's most experienced project managers and software developers. You'll learn how they've dealt with everything from managing teams to handling project stakeholders to runaway meetings and more.

This is O'Reilly's second book in its 97 Things series. My contributions included tips to Provide Regular Time to Focus and Work in Cycles.

SPARQL Federation and Quints

2009-07-31T08:47:00.000-07:00

There are currently a couple popular way to federate sparql endpoints together:

1) In Jena the service must be explicitly part of the query, and therefor the model,

2) In Sesame the basic query patterns must be associated with one or more endpoints before evaluating the query, or

3) Hack the remote query into a graph URI: http://gearon.blogspot.com/2009/05/federated-queries-long-time-ago-tks.html

Although both can be used to achieve the same results, Jena's solution puts more responsibility in the data model, and Sesame's put more responsibility in the deployment. Both have their trade offs, but I believe the query is suppose to be abstracted away from underlying services. The domain model (and therefore the queries) should not be aware of how the data is distributed (or stored) across a network. Therefore, I prefer to describe which graph patterns and relationships are available at each endpoint during deployment and make the application model independent of available service endpoints.

Furthermore, I think it is a bit silly to add yet another level of complexity to the basic query pattern. Adding the service level turns the basic query pattern from a quad to a quint.

To fully index a quint (with support for a service variable, which Jena does not support) would take 13 indexes (nearly double what a quad requires). Below is a table of some complexity levels and how many indexes they require to be fully indexed (variables could appear in any position within the pattern). I have included a theoretical sext that would allow you to group services in a network (just as graphs can be grouped in a service).

Level	#ofIdx	Term	Data Structure
double	2	subject	directed graph
triple	3	predicate	labelled directed graph
quad	7	graph	multiple labelled directed graphs
quint	13	service	replicated multiple labelled directed graphs
sext	25	network	trusted replicated multiple labelled directed graphs

Switching from triples to quad provides a big functionality leap (the ability to refer to an entire graph as a single resource). However, I question how much functionality a quint (or a sext) has over a quad. Couldn't the same functionality be put into a property of the graph (or embedded in the graph's URI authority). An inferencing engine/query could also conclude graph relationships like (subGraphOf), which would still allow a large, but precise, collection of graphs to be queried more effectively.

Hopefully, this topic will have more time to mature before the SPARQL working group makes any official decisions on the matter.

Enterprise Information Systems and Web Technologies

2009-07-22T05:34:00.000-07:00

I recently got back from speaking at the Enterprise Information Systems and Web Technologies conference in Orlando Florida. There I presented my paper on an Object-Oriented rules engine. In the talk I shared examples of when businesses need to coordinate, track data, and policy check between organizations. Such as in transportation, satellite data tracking, and contract management. I outlined the following requirements and went into detail on the various components of the system.

Requirements
Reduce the investment costs and time
Policies understandable by domain experts
Rules must not inadvertently interfer with one another
Model complex domains
Easily adapted to change
Policy rules have access to external services
Track all state changes both their cause and effect

The talk was well received and posed some interesting discussions.

Panel: Linked Open Data

2009-07-08T13:45:00.000-07:00

The SemTech 2009 Videos have been posted, including the Linked Open Data Panel.

The "data commons" is a cornerstone of the semantic web vision. The Linked and Open Data movements are progressing beyond the early adopter phase and preparing to cross the chasm. Enough experience now exists to reflect on how this data set is being used, how useful it is, and where we can take it from here. Beyond the basics, the panel will discuss issues such as quality of service, stability, and longevity. They'll also explore the evolution of the semantic web with a particular emphasis on modes of data use, reuse and aggregation.

Paul Miller, The Cloud of Data
Jamie Taylor, Metaweb Technologies, Inc.
Leigh Dodds, Talis
James Leigh, James Leigh Services, Inc.
Kingsley Idehen, OpenLink Software, Inc.

http://www.semanticuniverse.com/semtech-panel-linked-open-data.html

HTTP Servlet Caching Filter

2009-06-30T09:19:00.001-07:00

The nice thing about the HTTP protocol is how easy it is to implement an trivial HTTP server. At some point, however, just responding to HTTP requests is not enough and response caching must be introduced.

If you search for "servlet response caching", you will find advice to ensure you use the correct response headers to facilitate HTTP caching and suggestions to use a servlet filter to cache the response. If you are like me, you would continue to search for a way to use both - a servlet filter that caches based on the correct response headers.

With the HTTP protocol so well supported and J2EE so popular, finding a caching servlet filter that adheres to the HTTP spec should be easy, but it isn't. In fact, it is really hard to find any Java implementations that caches based on the HTTP response headers (servlet filter or otherwise). This seemed like an interesting problem that is fairly common, so I spent some time to see how far I could get with a servlet filter that understands HTTP caching.

Despite my general knowledge of the HTTP spec, implementing it is a lot more difficult. For example, the If-Modified-Since and If-None-Match headers are fairly easy to understand, but when you try and implement this logic, things get a little more complicated. In working through this, I realized that there are nearly 20 possible scenarios that need to be handled by the server. The request might not have an If-Modified-Since header, it might have been modified, and it might not have been modified are three states the server must handle for that header. The If-None-Match may not be present or it might or might not match and it might have a '*' tag, which may not may not have an existing entity. You can't just process these one at a time either, but once you write down the edge cases it can all be handled fairly compactly within a precondition check.

Another area that surprised me was the request Cache-Control directives. There are five boolean directives and three with a specified value. All of these directives are fairly easy to understand and are used to determine if the cache can be used and if it needs to be validated. However, that is a lot of variables to manage and combined with the possible server directives, its gets really harry tracking their state. There were many occasions when adding support for a new header/directive that I inadvertently broke an earlier unit test (couldn't have done it without them).

The HTTP spec is fairly clear is some areas, but less so in others. One area that has had various interpretations is entity tags. It is fairly clear how ETags should be used with static GET requests, although I had to digest their implications on caches before I could understand how to use them with content negotiation. However, their recommended use with PUT and DELETE is still a bit of a mystery. When an entity has no fixed serialized format (such as a data record), it has many entity tags (one for each serialized variation and version). So, which entity tag should be used after a PUT or a DELETE that effects all variations?

This get even more complicated when some URLs are used to represent a property of a data record. If the property is a foreign key, the response has no serializable format, its a 303 See Also response. What does a PUT look like when you want to reference another resource? Furthermore, a DELETE of a property, just deletes the property, but the data record still exists and there is still a version associated with it, shouldn't the client be given the new version?

In the end I have a new appreciation for why there are so many interpretations of the HTTP spec and I have a fairly general purpose HTTP caching servlet filter on top of it.

Resource Oriented Framework

2009-06-17T16:45:00.000-07:00

What happens when you put an Object oriented Rules Engine in a Resource Oriented Framework? After three years of research, I think I have found the answer and have released it as AliBaba.

AliBaba is separated into three primary modules. The Object Repository provides the Object Oriented Rules Engine. It is based on the Elmo codebase that has been in active development for the past four years. The Metadata Server is a Resource Oriented Framework built around the Object Repository. Finally, the Federation SAIL gives AliBaba more scalability.

In AliBaba, every resource is identified by a URL and can be manipulated through common REST operations (GET, PUT, DELETE). Each resource also has one or more types that enable it to take on Object Oriented features that can be defined in Java or OWL. Each object's properties and methods can be exposed with annotations as HTTP methods or operations. Operations are used with GET, PUT and DELETE HTTP methods by suffixing the URL with a '?' and the operation name. These operations are commonly used for object properties, while object methods are commonly exposed as other HTTP methods (POST) or as GET operations. This HTTP transparency allows the Metadata Server's API to hide within the HTTP protocol and not dictate the protocol used - allowing it to implement many existing RESTful protocols.

AliBaba provides a unique combination of Object Oriented Programming and a Rules Engine, available in a Resource oriented Framework. I believe it combines some of the most promising design paradigms commonly used in Web applications. Its potential to minimize software maintenance costs and maximize productivity by combining these paradigms is very exciting.

Intersecting Mixins

2009-06-08T08:14:00.000-07:00

As software models become increasingly complex, designers seek additional ways to express their domain models in a form that more closely matches their design concepts. One way this is done is through Mixins.

A Mixin is a reusable set of class members that can be applied to more then one class. It is similar to inheritance, but does not interrupt the existing hierarchy.

Suppose we have a class called "Invoice" within our domain model and we want to "enhance" this class with operations to fax, email, and snail-mail it to the customer. To prevent the reduction of Invoice's cohesion, we want to define this behaviour in a separate construct. We could subclass Invoice, but the ability to send a document is common among other classes as well. We could put this logic in a super class, but that only works if there is an appropriate common super class among them. An alternative is to create mixins, called Faxable, Emailable, and Mailable, that are added to all the classes that can be sent.

Suppose some of our documents require a unique header. If this behaviour is common, but no appropriate super exists, a mixin would be a desirable choice. Mixins allow classes to be extended with new behaviour, but what if you want to alter existing behaviour? Unfortunately, many mixin implementations do not allow calls to the overridden method, and the ones that do require it to be done procedurally (by changing an open class at runtime).

When using inheritance, a subclass can call the overridden method to intersect and alter the existing behaviour, but a mixin does not inherit the behaviour of its targets and there are possibly multiple mixins wanting to alter the same behaviour, so there is no single "super member", but one for every mixin that implements it.

Most languages allow a mixin to override the target's behaviour, but don't allow it to be intercepted. Some languages, like Ruby and Python, allow the target class to be altered by renaming and replacing members. This allows the programmer to simulate an intersection, but is a much more complex way of handling it.

In AliBaba's Object Repository a mixin, also known as a behaviour, can declare precedence among other mixins and allows them to control how method execution proceeds. For example if a mixin has the annotation @precedes, it will be executed before any of the given mixin classes are executed. By declaring the method with a @parameterTypes annotation, with the overridden method parameter types, and a Message as the method parameter, the mixin can call msg.proceed() to execute other behaviours and retrieve their result. This allows mixins to call the overridden methods and provides intersecting other methods.

By extending the basic mixin construct to allow them to co-exist and interact, a mixin can be used to address other aspect oriented problems in an OO way.