Monday, March 9, 2009

RDF Federation

An RDF store (unlike an rdbms store) usually indexes everything. This means that the store's indexes are often larger then the data that it contains. Such large indexes makes I/O expensive and caching difficult. Scaling vertically by adding more memory for caching and faster disks can make a big difference, but it can get very expensive very quickly.

The alternative is to scale horizontally. This can be done in one of two ways: by mirroring the indexes on other machines, or by partitioning the indexes to other machines. The first option, called clustering, can reduce the I/O load, but will still have difficult caching it. The second option, called federating, reduces the I/O load on each machine, and can allow each machine to specialize, making caching much more effective.

Federating RDF stores is now going to be a lot easier with Sesame 3.0. Sesame 3.0 will support federating multiple (distributed) Sesame Repositories into a unified store. This allows large indexes to be distributed on multiple machines that are connected over a network. The federation supports multiple ways of partitioning the data. It can be partitioned by predicate (property), by subject, or both. When properly setup the federation can effectively proxy queries to the specialized members and join queries among the distributed members. For large-scale RDF stores, federating is becoming a valuable solution for RDF architecture.

Instructions for setting up a read only Sesame Federation can be found here:
https://wiki.aduna-software.org/confluence/display/SESDOC/Federation

Reblog this post [with Zemanta]

7 comments:

  1. Federation sounds like a big step further towards structured data distribution in the Internet. When will Sesame 3.0 be officially released?

    ReplyDelete
  2. Hi Daniela,

    Sesame 3.0 is suspended until further development is made in the SPARQL 1.1 Working Group. However, the federation has been back ported to Sesame 2.3 and is currently shipping with AliBaba 2.0-alpha4.

    ReplyDelete
  3. Thanks for the answer! Is it possible to use the RDF Federation Sail with Sesame 2.3 without using AliBaba?

    ReplyDelete
  4. You don't have to use the object-mapper or server of AliBaba to use the federation. Just include the jar and follow the documentation.

    ReplyDelete
  5. Dear James,

    federation through data/index partitioning sounds as the universal solution for all sorts of problems in DBMS. Still, the guys in the relational DBMS do use it too much. Any guess why? Any thoughts what is the impact of the speed of query evaluation? Any ideas how to get around the so-called "remote join" problem?

    Cheers,
    Atanas Kiryakov

    ReplyDelete
  6. I meant "the guys in the relation DBMS do *not* user it too much" - sorry

    Atanas Kiryakov

    ReplyDelete
  7. Federation in RDBMS is not effective because of the data integrity constrains. RDBMS relies on pessimistic concurrency control, which requires centralization. Federating multiple data sources requires an optimistic concurrency control to scale effectively.

    Take a look at Eight Isolation Levels Every Web Developer Should Know for more information on concurrency controls.

    Partitioning your data effectively is a skill just as designing an effective table schema. Both can have significant impact on performance and scalability.

    A smart, well informed, query optimizer is essential for effective query processing. However, many distribution problems can be solved with minimal remote cross joins.

    ReplyDelete