Programming The Web: 2009

Monday, November 2, 2009

Why isn't the Web Object-Oriented?

A big part of the Web is web services, but often these services are not modelled using an object oriented paradigm, even though it is well suited for complex behaviours. Web services are often modelled using a simple request/response paradigm or a service oriented paradigm using a RESTful framework, but many of these resource oriented frameworks can be adapted to support some object oriented concepts.

Many people think of classes and methods when they think of Object-Oriented Programming (OOP). However, I like to think of OOP as message passing with class specialization. This is particularly helpful when designing Web services, which also use a message passing model. Even RESTful Web services use forms of message passing between nodes.

Consider the simple URL below. When followed a GET request is sent to a Google server. This can be thought of as sending Google's search object a message with the given search term parameter (using the Google network as the authority). The search object (in this case a proxy) responds with an HTML page with the search results.

   Object Authority
 _________|_________
/                   \
http://www.google.com/search?q=Why+isn%27t+the+Web+Object-Oriented%3F
\__________________________/  \_____________________________________/
             |                                  |
       Object Identity                       message

All HTTP requests can be thought of as messages being sent to remote objects. The request method, query parameters, headers, and body make up the message, and the request URI identifies the message's target object. The HTTP response is the message's return value.

However, OOP is more than simply message passing. A big part of OOP is the association of behaviour with data. The relationship between behaviour and data drives at the difference between service oriented and object oriented paradigms. A service oriented model is like an object oriented model, but all objects are stateless singletons with their own unique behaviour. Because of this, pure service oriented systems can be more efficient (less data access), but is more expensive to maintain, as each service must consider all possible variations at once. In contrast, OOP supports behaviour specialization and can more closely reflect the structure of systems 'in the real world'.

While many services are identified by a single request URI (scheme+authority+path), most RESTful frameworks allow data to also be associated with the URI. JAX-RS, for example, allows path parameters that are often populated with a unique entity ID. By incorporating the entity ID in the URI, data is associated with the behaviour in the same way as in an OOP paradigm. However, most RESTful frameworks fail to provide any support for object or resource behaviour specialization -- a feature that is incredibly powerful in class-based OOP.

The Web is actually fairly close to seamlessly supporting an object-oriented paradigm. Processing efficiency seems to be the only barrier. However, with the growing costs of maintaining complex Web systems, I'm not sure how long this argument can hold up. When do you think we'll have an object oriented Web framework and what would it look it?

Monday, October 26, 2009

The Complicated Software Stack

To aspiring Web application developers or people looking to put together their own Web application: the road to building a modern working Web application is a long and complicated journey.

Today's Web application developer is nothing short of a jack-of-all-trates, requiring deep knowledge of everything from HTML and CSS to Java and SQL. Everything from common CRUD tasks to sophisticated work-flows requires knowledge of half a dozen computer languages along with their quirks and variations across platforms and applications.

Today's software is built using a mix of programming paradigms and data models. Every level in the software stack requires explicit data mapping between paradigms. Many Web applications include the following levels in their software stack:
• Relational for persistence,
• Object oriented (class-based) in the model,
• Aspects peppered throughout,
• Resource (or activity) oriented Web services,
• Functional template engines,
• Markup using key/value pairs, and
• Prototype based objects for UI behaviour.

The above complication comes at a price. Software takes longer to develop and is more expensive to maintain than it used to be. This is causing a greater divide between small tools and large software systems.

Applications, like Microsoft Excel, which combine data processing and persistence using a consistent programming paradigm, have grown in popularity as a cheap alternatives to the complexity of modern Web applications.

While the market for Web applications has grown, the scope has decreased, favouring large high volume systems. Smaller Web applications are too often over-architected and over-budget. There is a large (and growing) opportunity for software vendors to fill this divide and create a new platform that combines data processing and persistence, using a single programming paradigm, for Web applications.

Can Web applications be built to use a single programming paradigm?

Tuesday, September 29, 2009

Chrome Frame: Love It Or Hate It

Image via Wikipedia

Google has clearly struck a nerve among browser makers with the announcement of Chrome Frame. Microsoft was awfully quick to down play any thoughts about installing Chrome as a plugin for IE considering it refers to the WebKit's market share as a "rounding error". Mozilla has also recently become vocal about putting down any notion of a browser-in-a-browser solution. This is all quite bizarre as both of these players are big into browser plugins of some form or another. Microsoft with its alternative Silverlight application engine and Mozilla, which acquired its market share through extensible plugins of its own.

It is actually quite common to have multiple rendering engines within the same browser: flash, silverlight, and Java being the most obvious, but there is more. IE has had a number of browser plugins in the past, including Mozilla ActiveX Control and Google's SVG plugin. IE8 ships with multiple rendering engines that get triggered based on HTML tags or user actions. Nescape 7, although short lived, shipped with both the Gecko and IE rendering engines. Mozilla has previously encouraged this type of action in the past, with Google's ExCanvas and Mozilla's, now inactive, Screaming Monkey initiative. Today Mozilla still makes IE available as a Firefox plugin.

I think it is ridiculous to ask users to only use particular browsers for particular websites. Choosing the best available rendering engine should be the choice of the website authors and I would welcome a mega-browser that seamlessly switches between Gecko, Trident, WebKit, and Presto based on the preferred engine of the author. More precisely, I trust website authors will choose standard compliant engines more then I trust users to choose standard compliant browsers.

I find Mozilla's reaction particularly interesting as it comes at a time when I find myself, an old Gecko fan, looking at WebKit more seriously. Recently in a project, due to an old outstanding Gecko issue, I had to put Firefox support on hold while Trident, Presto and WebKit continued to operate without much trouble.

I know it is true with IE, but perhaps it is true with Mozilla as well, that they view the engine as just something a browser needs and not a feature in and of itself. Perhaps I have been wrong all along and XUL is actually Mozilla's doom.

Sunday, September 20, 2009

Accept Headers: In The Wild

As web agents (including browsers) become more diverse there is an increasing need to distinguish between their types. The User-Agent header can be used for this task, but requires the server to know in advance all the possible agents and what type they are. This is not possible as both the diversity and quantity of agents is growing too quickly for any single registry to track.

According to the HTTP specification, the Accept header can be used to determine the type of agent. For example:
• HTML browsers should include "text/html" within the Accept header,
• XHTML browsers include "application/html+xml",
• RDF browsers include "application/rdf+xml",
• XSLT agents include "application/xml",
• PDF agents include "application/pdf",
• Office suites include "application/x-ms-application" or "application/vnd.oasis.opendocument", and
• JavaScript libraries include "application/json"

This allows the server to better redirect the agent to an appropriate resource.

Obviously, if a service will only serve HTML browsers, the type of agent is not necessary, as is the case in the Web 1.0 days when everything on the Web was HTML. However, as HTTP is becoming a more popular protocol for non-HTML communication, the need for distinguishing between types of agents is becoming important.

Consider the situation when an abstract information resource (like an order or an account) is identified by a URL. When the server receives a request for an abstract information resource, it needs to know which type of agent is requesting it, so it can better redirect the agent to an appropriate representation. If the agent is an HTML browser, the server should redirect to an html page displaying the order or account information; if a JavaScript library, the server should redirect to a json dump of the order/account summary; if a PDF agent, the server should redirect to a order/account summary report; if an office suite, the server should redirect to a spreadsheet of the details.

This works very well in theory, but because the Web was built with only HTML browsers in mind, most browsers don't properly implement the HTTP specification (because they don't have to). Even worse is that most non-HTML browser agents either don't include an Accept Header at all or use */* and say nothing about the type of agent. Below are some of the default accept headers from popular user agents on the web.

FF3.5 is an HTML and XHTML browser first, XML/XSLT agent second
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

IE8 is a media viewer (apparently)
image/gif, image/jpeg, image/pjpeg, image/pjpeg, application/x-shockwave-flash, */*

IE8+office is a media viewer and office suite
image/gif, image/jpeg, image/pjpeg, application/x-ms-application,
application/vnd.ms-xpsdocument, application/xaml+xml,
application/x-ms-xbap, application/x-shockwave-flash,
application/x-silverlight-2-b2, application/x-silverlight,
application/vnd.ms-excel, application/vnd.ms-powerpoint,
application/msword, */*

Chrome3 is an XHTML and XML/XSLT agent first, HTML browser second, and text viewer third.
application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5

Safari3 is an XHTML and XML/XSLT agent first, HTML browser second, and text viewer third.
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5

Opera10 is an HTML and XHTML browser first, XML/XSLT agent second.
text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1

The MSN bot is an HTML browser, text viewer, xml client and application archiver.
text/html, text/plain, text/xml, application/*, Model/vnd.dwf, drawing/x-dwf

Google search bot is a jack of all agents, master of none
*/*

Yahoo search bot is a jack of all agents, master of none
*/*

AppleSyndication is a jack of all agents, master of none
*/*

See Also:
Unacceptable Browser HTTP Accept Headers (Yes, You Safari and Internet Explorer)
WebKit Team Admits Error, Downplays Importance, Re: 'Unacceptable Browser HTTP Accept Headers'

Monday, August 24, 2009

Dereferencable Identifiers

A document URL is a dereferencable document identifier. We use URLs all over the Web to identify HTML pages and other web resources. When you can't give out a brochure you can share a URL. Instead of sending a large email attachment, you might just send a URL instead. Rather then creating long appendixes, you can simply link to other resources. It is so much more useful to pass around URLs then it is trying to transfer entire documents around.

This model has worked well for document and is now being adopted for other type of resources. With the popularity of XML, using URLs to identify data resources is now commonplace. Rather then passing around a complete record, agents pass around an identifier that can be used to lookup the record later. By using a URL as the identifier these agents don't need to be tied to any single dataset and are much more reusable.

From the HTML5 standardization process has risen the debate on the usefulness of URLs as model identifier. Most people agree that a URL is a good way to identify documents, web resources and data resources. However, the debate continues on the usefulness of using a URL as an identifier within a model vocabulary. One side claims that a model vocabulary should be centralized and therefore does not require the flexibility of a URL. The other side claims the model vocabulary should be extensible and requires a universal identifying scheme that URLs provide.

To understand the potential usefulness of using a URL as a model identifier, consider the behaviour difference between a missing DTD and a missing Java class. A DTD is identified using a URL and a Java class is not. When an XML validator encounters a DTD it does not understand it dereferences the identifier and uses the resulting model to process the XML document. When a JVM encounters a Java class it does not understand it throws an exception, often terminating the entire process. Now consider how much easier it would be to program if a programming environment used URLs for classes and model versions. Dependency management would become as simple as managing import statements. As the Web becomes the preferred programming environment of the future, we must consider these basic programming concerns.

Although I enjoy working in abstractions, I certainly understand how things always get more complicated when you go meta: using URLs to describes other URLs. However, this complexity is essential to continue to maintain the flexibility and extensibility of the Web.

See Also: HTML5/RDFa Arguments

Sunday, August 23, 2009

97 Things Every Project Manager Should Know

If the projects you manage don't go as smoothly as you'd like, 97 Things Every Project Manager Should Know offers knowledge that's priceless, gained through years of trial and error. This illuminating book contains 97 short and extremely practical tips -- whether you're dealing with software or non-IT projects -- from some of the world's most experienced project managers and software developers. You'll learn how they've dealt with everything from managing teams to handling project stakeholders to runaway meetings and more.

This is O'Reilly's second book in its 97 Things series. My contributions included tips to Provide Regular Time to Focus and Work in Cycles.

Friday, July 31, 2009

SPARQL Federation and Quints

There are currently a couple popular way to federate sparql endpoints together:

1) In Jena the service must be explicitly part of the query, and therefor the model,

2) In Sesame the basic query patterns must be associated with one or more endpoints before evaluating the query, or

3) Hack the remote query into a graph URI: http://gearon.blogspot.com/2009/05/federated-queries-long-time-ago-tks.html

Although both can be used to achieve the same results, Jena's solution puts more responsibility in the data model, and Sesame's put more responsibility in the deployment. Both have their trade offs, but I believe the query is suppose to be abstracted away from underlying services. The domain model (and therefore the queries) should not be aware of how the data is distributed (or stored) across a network. Therefore, I prefer to describe which graph patterns and relationships are available at each endpoint during deployment and make the application model independent of available service endpoints.

Furthermore, I think it is a bit silly to add yet another level of complexity to the basic query pattern. Adding the service level turns the basic query pattern from a quad to a quint.

To fully index a quint (with support for a service variable, which Jena does not support) would take 13 indexes (nearly double what a quad requires). Below is a table of some complexity levels and how many indexes they require to be fully indexed (variables could appear in any position within the pattern). I have included a theoretical sext that would allow you to group services in a network (just as graphs can be grouped in a service).

Level	#ofIdx	Term	Data Structure
double	2	subject	directed graph
triple	3	predicate	labelled directed graph
quad	7	graph	multiple labelled directed graphs
quint	13	service	replicated multiple labelled directed graphs
sext	25	network	trusted replicated multiple labelled directed graphs

Switching from triples to quad provides a big functionality leap (the ability to refer to an entire graph as a single resource). However, I question how much functionality a quint (or a sext) has over a quad. Couldn't the same functionality be put into a property of the graph (or embedded in the graph's URI authority). An inferencing engine/query could also conclude graph relationships like (subGraphOf), which would still allow a large, but precise, collection of graphs to be queried more effectively.

Hopefully, this topic will have more time to mature before the SPARQL working group makes any official decisions on the matter.

Wednesday, July 22, 2009

Enterprise Information Systems and Web Technologies

I recently got back from speaking at the Enterprise Information Systems and Web Technologies conference in Orlando Florida. There I presented my paper on an Object-Oriented rules engine. In the talk I shared examples of when businesses need to coordinate, track data, and policy check between organizations. Such as in transportation, satellite data tracking, and contract management. I outlined the following requirements and went into detail on the various components of the system.

Requirements
Reduce the investment costs and time
Policies understandable by domain experts
Rules must not inadvertently interfer with one another
Model complex domains
Easily adapted to change
Policy rules have access to external services
Track all state changes both their cause and effect

The talk was well received and posed some interesting discussions.

Wednesday, July 8, 2009

Panel: Linked Open Data

The SemTech 2009 Videos have been posted, including the Linked Open Data Panel.

The "data commons" is a cornerstone of the semantic web vision. The Linked and Open Data movements are progressing beyond the early adopter phase and preparing to cross the chasm. Enough experience now exists to reflect on how this data set is being used, how useful it is, and where we can take it from here. Beyond the basics, the panel will discuss issues such as quality of service, stability, and longevity. They'll also explore the evolution of the semantic web with a particular emphasis on modes of data use, reuse and aggregation.

Paul Miller, The Cloud of Data
Jamie Taylor, Metaweb Technologies, Inc.
Leigh Dodds, Talis
James Leigh, James Leigh Services, Inc.
Kingsley Idehen, OpenLink Software, Inc.

http://www.semanticuniverse.com/semtech-panel-linked-open-data.html

Tuesday, June 30, 2009

HTTP Servlet Caching Filter

The nice thing about the HTTP protocol is how easy it is to implement an trivial HTTP server. At some point, however, just responding to HTTP requests is not enough and response caching must be introduced.

If you search for "servlet response caching", you will find advice to ensure you use the correct response headers to facilitate HTTP caching and suggestions to use a servlet filter to cache the response. If you are like me, you would continue to search for a way to use both - a servlet filter that caches based on the correct response headers.

With the HTTP protocol so well supported and J2EE so popular, finding a caching servlet filter that adheres to the HTTP spec should be easy, but it isn't. In fact, it is really hard to find any Java implementations that caches based on the HTTP response headers (servlet filter or otherwise). This seemed like an interesting problem that is fairly common, so I spent some time to see how far I could get with a servlet filter that understands HTTP caching.

Despite my general knowledge of the HTTP spec, implementing it is a lot more difficult. For example, the If-Modified-Since and If-None-Match headers are fairly easy to understand, but when you try and implement this logic, things get a little more complicated. In working through this, I realized that there are nearly 20 possible scenarios that need to be handled by the server. The request might not have an If-Modified-Since header, it might have been modified, and it might not have been modified are three states the server must handle for that header. The If-None-Match may not be present or it might or might not match and it might have a '*' tag, which may not may not have an existing entity. You can't just process these one at a time either, but once you write down the edge cases it can all be handled fairly compactly within a precondition check.

Another area that surprised me was the request Cache-Control directives. There are five boolean directives and three with a specified value. All of these directives are fairly easy to understand and are used to determine if the cache can be used and if it needs to be validated. However, that is a lot of variables to manage and combined with the possible server directives, its gets really harry tracking their state. There were many occasions when adding support for a new header/directive that I inadvertently broke an earlier unit test (couldn't have done it without them).

The HTTP spec is fairly clear is some areas, but less so in others. One area that has had various interpretations is entity tags. It is fairly clear how ETags should be used with static GET requests, although I had to digest their implications on caches before I could understand how to use them with content negotiation. However, their recommended use with PUT and DELETE is still a bit of a mystery. When an entity has no fixed serialized format (such as a data record), it has many entity tags (one for each serialized variation and version). So, which entity tag should be used after a PUT or a DELETE that effects all variations?

This get even more complicated when some URLs are used to represent a property of a data record. If the property is a foreign key, the response has no serializable format, its a 303 See Also response. What does a PUT look like when you want to reference another resource? Furthermore, a DELETE of a property, just deletes the property, but the data record still exists and there is still a version associated with it, shouldn't the client be given the new version?

In the end I have a new appreciation for why there are so many interpretations of the HTTP spec and I have a fairly general purpose HTTP caching servlet filter on top of it.

Wednesday, June 17, 2009

Resource Oriented Framework

What happens when you put an Object oriented Rules Engine in a Resource Oriented Framework? After three years of research, I think I have found the answer and have released it as AliBaba.

AliBaba is separated into three primary modules. The Object Repository provides the Object Oriented Rules Engine. It is based on the Elmo codebase that has been in active development for the past four years. The Metadata Server is a Resource Oriented Framework built around the Object Repository. Finally, the Federation SAIL gives AliBaba more scalability.

In AliBaba, every resource is identified by a URL and can be manipulated through common REST operations (GET, PUT, DELETE). Each resource also has one or more types that enable it to take on Object Oriented features that can be defined in Java or OWL. Each object's properties and methods can be exposed with annotations as HTTP methods or operations. Operations are used with GET, PUT and DELETE HTTP methods by suffixing the URL with a '?' and the operation name. These operations are commonly used for object properties, while object methods are commonly exposed as other HTTP methods (POST) or as GET operations. This HTTP transparency allows the Metadata Server's API to hide within the HTTP protocol and not dictate the protocol used - allowing it to implement many existing RESTful protocols.

AliBaba provides a unique combination of Object Oriented Programming and a Rules Engine, available in a Resource oriented Framework. I believe it combines some of the most promising design paradigms commonly used in Web applications. Its potential to minimize software maintenance costs and maximize productivity by combining these paradigms is very exciting.

Monday, June 8, 2009

Intersecting Mixins

As software models become increasingly complex, designers seek additional ways to express their domain models in a form that more closely matches their design concepts. One way this is done is through Mixins.

A Mixin is a reusable set of class members that can be applied to more then one class. It is similar to inheritance, but does not interrupt the existing hierarchy.

Suppose we have a class called "Invoice" within our domain model and we want to "enhance" this class with operations to fax, email, and snail-mail it to the customer. To prevent the reduction of Invoice's cohesion, we want to define this behaviour in a separate construct. We could subclass Invoice, but the ability to send a document is common among other classes as well. We could put this logic in a super class, but that only works if there is an appropriate common super class among them. An alternative is to create mixins, called Faxable, Emailable, and Mailable, that are added to all the classes that can be sent.

Suppose some of our documents require a unique header. If this behaviour is common, but no appropriate super exists, a mixin would be a desirable choice. Mixins allow classes to be extended with new behaviour, but what if you want to alter existing behaviour? Unfortunately, many mixin implementations do not allow calls to the overridden method, and the ones that do require it to be done procedurally (by changing an open class at runtime).

When using inheritance, a subclass can call the overridden method to intersect and alter the existing behaviour, but a mixin does not inherit the behaviour of its targets and there are possibly multiple mixins wanting to alter the same behaviour, so there is no single "super member", but one for every mixin that implements it.

Most languages allow a mixin to override the target's behaviour, but don't allow it to be intercepted. Some languages, like Ruby and Python, allow the target class to be altered by renaming and replacing members. This allows the programmer to simulate an intersection, but is a much more complex way of handling it.

In AliBaba's Object Repository a mixin, also known as a behaviour, can declare precedence among other mixins and allows them to control how method execution proceeds. For example if a mixin has the annotation @precedes, it will be executed before any of the given mixin classes are executed. By declaring the method with a @parameterTypes annotation, with the overridden method parameter types, and a Message as the method parameter, the mixin can call msg.proceed() to execute other behaviours and retrieve their result. This allows mixins to call the overridden methods and provides intersecting other methods.

By extending the basic mixin construct to allow them to co-exist and interact, a mixin can be used to address other aspect oriented problems in an OO way.

Thursday, June 4, 2009

Time to Rethink MVC on the Web

Typical web applications don't do a very good job at separating the model, view, and controller logic. The view is particularly confusing as logic is split between server side and client side. This is a side effect from trying to support "dumb" (or nearly dumb) clients (browsers that only support static HTML). Furthermore, logic that would be more appropriate in the model, often gets put into the controller to avoid unnecessary access to the database.

While these design decisions lead to faster web application servers, they also lead to what I call coincidence coding - an unwanted code separation that has no documentation or declared interface. The worst part is that this separation often happens in the most visible part of the system: over HTTP and SQL. This prevents the system from becoming a block box utility, because the internals rear their ugly heads. Any attempt to standardize the protocol (or to use well documented services) is impossible because the protocols get stuck between various components of the view or model and cannot be separated cleanly and maintained across versions.

It is not all doom and gloom: the web is a different place then it was when many of these web frameworks were designed. I am pleased to report that all modern desktop web browsers have good support for client side templating using XSLT (finally!). This means web applications today don't need to support these dumb clients. Furthermore, web proxies have also been improving enough that it is now common for production deployments to plan on use a proxy server in front of the web server(s). Accessing the persisted data is not as much of a problem as it used to be, as database caching is usually planned when systems are being designed.

With these (somewhat) recent developments, its time for a new breed of web frameworks to emerge. These frameworks could implement MVC on the web like we have never seem before - really separating the model from the view and from the controller. This could have a significant impact on the maintainability of web application software.

Yesterday, I introduced what I believe to be the first web application framework that does a decent job of separating MVC. It is called the AliBaba Metadata Server and can be downloaded from OpenRDF.org.

The view logic can only exist in static HTML, JS, CSS, and XSLT files. The model/data is transformed into standard data formats, like RDF and some forms of JSON without model specific transformations. You can't put any model logic into the controller, and all service logic must be put into the model. Stand alone, it would be slower than most web application servers, but with proxy and caching on both ends, I believe it will perform just as well and, most importantly, yield more maintainable web applications.

Monday, June 1, 2009

Standard RDF Protocol

The SPARQL Working Group's first meeting has come and gone. Part of that meeting discussed the scope of what should be undertaken as part of SPARQL2, including what features should be added to the protocol. SPARQL is a standard query language and protocol.

I am a big fan of standards, as most people are in the Semantic Web community. However, I feel that a database protocol does not provide much value and in fact, should be tailored to the evaluation and storage mechanism used by the implementation.

Standards are intended to be interoperable between implementations, and SPARQL must abstract away from the storage mechanism used in order to achieve the desired level of interoperability. This leads SPARQL to result in a less than efficient query/retrieval operation, when compared to storage specific mechanisms like SQL. However, this still has significant advantages, as a single query can be used with a wide array of storage mechanisms and remains unchanged between significant storage alterations. This does come at a cost to both the ability to create efficient queries and the ability to evaluate queries effectively using the SPARQL protocol.

The cost of query parsing and optimization can always be improved and can be tailored to specific models, without any loss of interoperability at the query language level. However, a standard protocol cannot be optimized the same way.

When a database backed application has performance problems, 9 out of 10 times it is due to excessive communication between the database and its clients. Therefore, I question the value in abstracting this communication protocol away from the storage/evaluation mechanism, because this will exacerbate communication overhead.

While I think the SPARQL query language is developing nicely and should be considered in any project that wants to ease the maintainability of their queries, I also think any project that is concerned about the performance of their queries should consider using a proprietary protocol to optimize its communication with a database.

Monday, May 25, 2009

Speaking at SemTech

I will be speaking at the Semantic Technology Conference in San Jose again this year from June 14-18, 2009.

"Modelling Objects in RDF" talk will be given for attendees that want to learn how to get started quickly building applications that use an RDF store.

Paul Gearon and I will compare and contrast the features and structure of modern RDF stores to help you find the stores that works best for your environment and data.

I will be discussing "Linked Open Data" in a panel to give you a chance to ask questions and get answers about how to share and use interconnected data across the web.

Thomas Tague and I will demonstrate how you can you use OpenCalais (at no cost) to extract metadata and named entities into the Linked Data cloud from english text.

As a speaker, I am authorized to share registration discounts up to $200. If have not registered yet and are interested in attending, please contact me.

Thursday, May 21, 2009

Toronto College of Technology TechTalk

I will be giving my "Modelling Objects in RDF" talk again at TCT next week on May 30th. If you missed the last JUG meeting and are interested in learning/talking about RDF, come by for the talk.

More details can be found here:
http://www.torontocollege.com/Display.action

A Map can be found here:
http://www.torontocollege.com/BI/location.jsp

Annotation Properties

In Java annotation properties are grouped in annotation classes, which are grouped in packages. Since Java does not consider properties a top level construct (they must be in a class), this organization makes sense for Java. However, often burying the property this deep can make the syntax hard to read.

Compare the annotation syntax of JAX-RS vs Spring's annotation-based controller configuration. Spring makes full use of the annotation class/property grouping and can often look confusing as many annotation properties are crammed into the same declaration. Whereas JAX-RS tries to collapse annotation classes and properties together, allowing the annotation declaration to be simplified.

Java allows annotation properties to be omitted if they have the name "value". This allows the coder to simply state the annotation class and the property value (omitting the annotation property).

I argue that when using the short syntax, the annotation class (conceptually) becomes the annotation property and therefore should start with a lower case. This is how JavaDoc annotations have always be done (start with a lower case letter) and I think it should be carried over to Java annotations as well.

Conceptually, annotation properties are akin to static final class properties and therefore should be easy distinguished from traditional classes. Creating lower case annotations class files is one easy way to do this.

What do you think? Is it okay to create a Java file that starts with a lower case letter? If so, under what conditions?

Thursday, May 14, 2009

Is Google the database of the Web?

Yesterday Google announced plans to improve its user's searching experience by providing more ways to search and find information on the Web. With Google's new search options and rich snippets, it is putting it self in a position to not simply link to desired pages, but actually provide the information directly.

By adopting RDFa, Google hopes to make it possible to start importing the deeper web into its cluster. It is starting by asking publishers of reviews, people, products, and organizations sites to share the raw data directly, so it can server the information without requiring its users to visit the publisher's site. Although Google is not the first search engine to use RDFa in the results, it is the first to start using it across all sites.

I am happy to finally see Google start to include rich semantics in its interface, but I worry what this might mean to publishers. Will Google become the database of the Web? Will the Web surfers of tomorrow never leave Google's domain? Time will tell, but lets hope this triggers new discussions on how we "browse" the growing web of data.

Wednesday, May 6, 2009

Toronto JUG Meeting

My presentation at yesterday's Toronto JUG meeting covered a lot of ground as it both introduced RDF and demonstrated how to use RDF with objects. Many of attendees struggled with the significant mind-shift from relational to RDF and how a data model can be represented as a graph of nodes.

One important point, that I want to reiterate, is that RDF stores are not designed as general purpose databases (as relational databases are), but instead RDF stores are designed for complex data structures. Currently, RDF is mostly used within industries that have complicated data models. However, RDF is becoming more appealing to a wider audience as data models in general are becoming more complicated and interconnected.

If you attended the meeting and/or are interested in talking more about RDF. Read through the getting started guide for Sesame[1] and join the discussions on IRC at irc://irc.freenode.net/sesame

[1] http://wiki.aduna-software.org/confluence/display/SESDOC/GettingStarted

Thursday, April 30, 2009

SPARQL Highlighting in GEdit

In searching for highlighting mode for N3 and gedit, the best I could fine was a blog post from a couple years ago[1]. When writing N3 I am overly cautious about typos, and I could use some indication of syntax errors. So I created a SPARQL syntax highlighting mode for gtksourceview. This will highlight all known xsd, rdf, rdfs, and owl classes, datatypes, and predicates to make creating ontologies in N3 easier.

The file can be downloaded[2] and saved as ~/.gnome2/gtksourceview-1.0/language-specs/sparql.lang. Then the SPARQL mode is available from the View->Highlight Mode->Sources menu.

[1] http://www.semikolon.co.uk/blog/index.php?entry=entry070510-102401
[2] http://bugzilla.gnome.org/attachment.cgi?id=134935&action=view

Tuesday, April 28, 2009

Speaking at Toronto's JUG

I will be giving a talk entitled "Modelling Objects in RDF" next week at Toronto's Java User Group. If you are in the city, please come by and introduce yourself.

Today's most used database model is the relational model. However, in today's high performance web-centric world, the relational database has begun to show its age. This talk introduces the Resource Description Framework (RDF) as an alternative database model. RDF is a family of standards to model properties and relationships between resources in a web-centric way. Unlike the relational model, which stores strictly defined records, an RDF model stores semi-structured graphs. This allows RDF stores to model more complex relationships, scale better across the web, and tolerate schema variations for compatibility.

Object-oriented design is the preferred paradigm for modelling complex software; however, most RDF APIs are tuple-oriented and lack some of the fundamental concepts in object-oriented programming. This talk introduces Sesame and how it can be used with RDF resources using object-oriented designs.

Event details can be found here:

http://torontojug.org/

Thursday, April 23, 2009

AtomPub as a Discovery Protocol

AtomPub was designed as a blog API, but with all the advancement the browsers have made as an application platform, it hasn't achieved wide spread use among bloggers. However, that doesn't limit the usefulness of it as a general publishing protocol.

My colleagues at Zepheira and I have been putting AtomPub into action recently. In our latest project we used AtomPub for a marketplace service registry. This allows others to publish the existence of their web service into collections. Because of the thorough documentation of AtomPub, it was easy to get everybody on board. With a simple example everybody started seeing the advantages of using a standard protocol. Generally speaking the AtomPub server was a breeze to maintain as all operations could be done using curl.

In the same project we also used OpenSearch with Atom to integrate multiple (private) search engines into a unified result. With some sugar and spice added to the "self" links we created a very impressive search solution.

Although AtomPub hasn't achieved much popularity within the blogging world, there are lots of benefits in using a standard protocol and AtomPub has proven itself as a great discovery protocol to us.

Tuesday, March 31, 2009

Long Running Transactions

I have been experimenting recently with long running transactions and am intrigued by its potential. The term 'long running transaction' is loaded with interpretations, so let me explain. Normally, when one thinks of a long running transaction, they think of a set of operations that can be undone. The implications of which are domain/implementation specific.

However, I am experimenting with atomic optimistic long running transactions - like a database transaction that spans multiple HTTP requests. The key requirement here is that it is optimistic. This means there are no resource locks used across the requests, allowing concurrent access and non-conflicting modification.

What makes this interesting, is when using a web interface to manipulate a complex structure. While other approaches would have the structure copied into session memory or track undo operations, this utilizes the indexes already present in the store (RDF in this case) and there is no need to track or even develop complicated undo behaviour. We are all familiar with the classic cancel/apply or restore/save buttons of local applications that span an application or set of dialogs. With optimistic long running transactions, this same experience can be used in a web application.

Thursday, March 19, 2009

Eight Isolation Levels Every Web Developer Should Know

The ACID properties are one of the cornerstones of database theory. ACID defines four properties that must be present if a database is considered reliable: Atomicity, Consistency, Isolation, and Durability. While all four properties are important, isolation in particular is interpreted with the most flexibility. Most databases provide a number of isolation levels to choose from, and many libraries today add additional layers which create even more fine-grained degrees of isolation. The main reason for this wide range of isolation levels is that relaxing isolation can often result in scalability and performance increases of several orders of magnitude.

Read More Here:
http://www.infoq.com/articles/eight-isolation-levels

Monday, March 9, 2009

RDF Federation

An RDF store (unlike an rdbms store) usually indexes everything. This means that the store's indexes are often larger then the data that it contains. Such large indexes makes I/O expensive and caching difficult. Scaling vertically by adding more memory for caching and faster disks can make a big difference, but it can get very expensive very quickly.

The alternative is to scale horizontally. This can be done in one of two ways: by mirroring the indexes on other machines, or by partitioning the indexes to other machines. The first option, called clustering, can reduce the I/O load, but will still have difficult caching it. The second option, called federating, reduces the I/O load on each machine, and can allow each machine to specialize, making caching much more effective.

Federating RDF stores is now going to be a lot easier with Sesame 3.0. Sesame 3.0 will support federating multiple (distributed) Sesame Repositories into a unified store. This allows large indexes to be distributed on multiple machines that are connected over a network. The federation supports multiple ways of partitioning the data. It can be partitioned by predicate (property), by subject, or both. When properly setup the federation can effectively proxy queries to the specialized members and join queries among the distributed members. For large-scale RDF stores, federating is becoming a valuable solution for RDF architecture.

Instructions for setting up a read only Sesame Federation can be found here:
https://wiki.aduna-software.org/confluence/display/SESDOC/Federation

Monday, March 2, 2009

What is the Future of Database Systems?

Last month saw a flurry of activity around the future of relational databases. Although no one can predict the future, I think it is safe to say that many system developers/architects are hungry for mainstream semi-structure databases.

Feb 12 Is the Relational Database Doomed?
Feb 13 Is The Relational Database Doomed?
Feb 13 CouchDB could be a viable alternative to relational databases for storing patient data
Feb 16 The future of RDBMS's
Feb 20 Is the Relational Database Not an Option in Cloud Computing?
Feb 27 How FriendFeed uses MySQL to store schema-less data

Thursday, February 26, 2009

RDF Transaction Isolation

Transaction Isolation in relational databases (for better or worse) is well established. However, the issue of transaction isolation is rarely documented in RDF stores.

The ANSI SQL isolation definitions are UPDATE (write) oriented and do not capture the general use case of RDF, which has no notion of UPDATE. For example, the first ANSI SQL phenomenon, dirty-write, is not even applicable to RDF transaction. Another phenomenon, non-repeatable reads, is defined by records retrieved by a SELECT statement. However, RDF queries (unlike SQL) are pattern based and the results don't have a direct relationship to any internal data record.

Relational database isolation mechanisms do not perform nearly as well when INSERT/DELETE operations are used instead of UPDATE. Furthermore, relational databases often have a lax definition of "serializable", allowing conflicting INSERT operations (assuming that preventing conflicting UPDATE operations is sufficient).

RDF is a different beast altogether. RDF is set oriented. Two RDF transactions adding or removing the same statement do not necessarily conflict with each other, as they would in SQL, because a successful add or remove operation in RDF does not require a state change.

Early RDF use cases required full serializable transaction as many of the inferencing rules used in RDF needed to take the complete store state into account. Because of this, RDF stores generally only provide full serializable transactions. However, full serializable transactions often do not perform as well as lower isolation levels.

RDF stores are now being used in environments that have a much greater, real-time demand, for fast concurrent write operations. These environments don't require full serialization, but currently lack any other isolation levels to choose from.

To address this need Sesame 3.0 introduces five isolation levels that will allow RDF stores to vary the isolation level provided. By providing different levels, significant performance improvements can be made for lower isolation levels. For example:

• Read Committed isolation level permits weak-consistency and allows proxies to cache repeated results without validation.
• Snapshot isolation level permits eventual-consistency and allows store clusters to maintain independent state and propagate the changes during idle periods.
• Serializable isolation provides a higher degree of isolation, but does not require atomic consistency, permitting concurrent transactions.

For more details on the isolation levels supported by Sesame 3.0 see:
http://wiki.aduna-software.org/confluence/display/SESDOC/TransactionIsolation

What variations of transaction isolation have you used in your application?

Tuesday, February 24, 2009

Sesame 3-alpha1

The first preview of the new Sesame API is now available. Here is an article explaining the new features: http://www.devx.com/semantic/Article/40987

Thursday, February 12, 2009

XHTML: What is it Good For?

With IE8 nearly upon us, discussion on the future of web standards has once again been triggered. Again lacking in IE8 is support for XHTML, ensuring that IE is the only browser that doesn't support it.

The goal of XHTML is to allow XML technologies to be used with HTML. XHTML has been a standard now for about five years, and IE is still (single handedly?) preventing websites from adopting it. Instead websites are forced to use server side solutions that require more bandwidth and processing, while making it harder for non-traditional agents to participate.

Within recent discussions on XHTML many people seem to fail to understand the potential benefit XHTML has over HTML. An interesting example I came across was combining XSLT and XHTML together.

I have written about XSLT before
http://jamesrdf.blogspot.com/2008/12/mvc-with-xslt.html

This example is from 2002 (around the time XSLT and XHTML were standardized) and can be found in more detail here:
http://www.webreference.com/xml/resources/books/practicalxml/chapter5/

As most of us know, XHTML is also an XML file and as such can be used as the input or output of XSLT. What many do not realize (and what is not covered in the above link) is that XHTML can also be used as the stylesheet. XSLT supports a simplified syntax that allows XSLT tags to be embedded inside an XHTML template file, making the template look a lot more like other server-side templating engines.
http://www.w3.org/TR/xslt#result-element-stylesheet

This allows you to use XHTML for the template and the content, and it works in all browsers, except *one* of them. Actually, you can get this to work in IE (even as early as IE5), but you have to use the XML rendering mode.

The XML rendering mode requires that the pages return application/xml and no doctypes present. Unfortunately by using the XML rendering mode, no HTML-specific features are available: no cookies, no document.write and script tags are parsed differently.

TV Series.com is rendered in XML mode, for example: http://www.tvseries.com/

If IE would get around to implementing XHTML, I think a lot more websites could safely switch to serving static files and the Web would start to become a lot easier to work with. But that probably wouldn't be good for Silverlight.

Monday, February 2, 2009

Elmo to Get a Face Lift

In 2008 Elmo's interface saw some adjustments to enable more efficient access across HTTP. The challenge with this new interface is to understand how the object operations map to RDF operations and what implications they have on performance over http.

Despite the changes, EntityManager-oriented interface is still inappropriate for RDF/Object mapping, because relational operations do not directly map to RDF operations. Further, object persistence abstraction causes more performance problems then it solves for seasoned developers. Unlike object-relation mapping, RDF-object mapping is much more natural.

Work has already begun on Elmo's successor, AliBaba Object Repository, providing an hybrid RDF/Object interface extension to Sesame. Performance over HTTP is a high priority and was the major inspiration to the new interface.

Early development has already shown significant performance increases for both read heavy and write heavy transactions. For those interested in getting involved, you can checkout the code at
http://repo.aduna-software.org/svn/org.openrdf/alibaba/trunk/

Thursday, January 29, 2009

Perfection Is An Unrealistic Goal

Linda Rising gave a presentation on a topic that is still fairly misunderstood. She tries to address the question "What is the best way for us to work as individuals?" She talks about not deceiving yourself and understanding that everything is a journey without a destination. It is better to build something that is known to be imperfect now, than wait for clarification, as more knowledge can be gain from a system that doesn't work than any theoretical discussions. Software development is like a series of experiments, a series of learning cycles. She then uses our regular sleep cycles to identify our optimal work cycles and emphasises the importance of taking breaks throughout the day to improve productivity.

http://www.infoq.com/news/2009/01/Perfection-Is-Unrealistic-Rising

Monday, January 26, 2009

Modelling Objects In RDF

Also at The Semantic Technology Conference this year, I will again be speaking on persisting Objects in RDF. This is an introductory session on building object-oriented applications with the Sesame RDF repository. This year we will be going into more detail on the advantages of using RDF as a persistence layer. Including achieving forward and backward data compatibility, OWL's representation richness, longer running optimistic transactions, and using graphs for auditing purposes.

Thursday, January 22, 2009

Unifying RDF Stores

In June I will be speaking at The Semantic Technology Conference with Paul Gearon about some of the integration work we did with Sesame and Mulgara. We will also be comparing a hand full of existing RDF stores and demonstrate how they can be used interchangeably or as a unified federation.

The RDF storage market has become much more divers recently, with many providers tailoring to specific environments and data patterns. This talk will cover some considerations to help you identify what RDF store implementation is best for your data and environment. We will discuss common features found with many providers, unique features found in only a few, and demonstrate some of the new features in Mulgara. We will also be demonstrating a unified RDF API to allow RDF stores to be swapped into applications post-development and how using a provider independent API enables divers RDF storage nodes to be federated together, allowing each node to be tailored to the unique shape of the data being stored within.

Hope to see you there!

Thursday, January 15, 2009

Validating RDF

RDFS/OWL is criticized for its weak ability to validate documents in contrary to XML, which has many mature validation tools.

A common confusion in RDF is the rdfs:range/rdfs:domain properties. A property value can always be assumed to have the type of the rdfs:range value. This is very different to XML, which only has rules to validate tags, but cannot conclude anything. Many of the predicates in RDF are used for similar inferencing, but they lacks any way to validate or check if a statement really is true. This is a critical feature for data interchange, which RDF is otherwise well suited for.

To address this limitation, an RDF graph can be sorted and serialized into RDF/XML. With a little organization of statements, such as grouping by subject, and controlled serialization, common XML validation tools can be applied to a more formal RDF/XML document. Our validation was done with relatively small graphs and we restricted the use of BNodes to specific statements to ensure similarly structured data would produce similar XML documents.

Although TriX could also have been used (it is a more formal XML serialization of RDF), it was considered that the format produced would not be as easy to work with for validation tools.

With a controlled RDF/XML structure we were able to apply RNG to provide structure validation before accepting foreign data and able to automate the export into more controlled formats using XSLT. (We used a rule engine for state validation.) Although RDF is a great way to interchange data against an changing model, XML is still better over the last mile to restrict the vocabulary of the data accepted.

Monday, January 12, 2009

Did Google Just Expose Semantic Data in Search Results?

Google has been hesitant, in the past, of employing semantic technology, citing trust issues and the lack of quality meta-data in The Web today. However, it would appear Google is warming to the idea of semantics in the area of natural language processing. I have written a couple articles on the subject previously, but it appears that Google is exposing their own text analysis, acording to this blog entry.

http://www.readwriteweb.com/archives/google_semantic_data.php

Thursday, January 8, 2009

Building Super-Scalable Web Systems with REST

I came across this blog posting that I thought was an interesting example of how the services should be oriented around the data and not the other way around.

http://www.udidahan.com/2008/12/29/building-super-scalable-web-systems-with-rest/

The idea here is that the data (in this case weather info) needs to be partitioned in a meaningful way (by location). REST services can then be created around this and utilize the caching available in HTTP.

By creating and organizing services around the domain's data model more efficient services can be created. This reinforces why SOA often ends up failing, because there is not enough emphasis on the data. It also highlights REST support for caching and cache validation built into the protocol. Other service/message specifications (like SOAP) would have more difficulty identify and implementing a caching mechanism.