| Home | Scott Streit's Resume | Training Courses | Course Content | Why Scott? | SolrStore | Feast | Contact |
Currently, the Jena API provides persistent storage of RDF data using relational database models such as MySQL, HSQLDB, PostgreSQL, Oracle and Microsoft SQL Server. These relational databases are convenient for storage but suboptimal for the storage and access model present in the Semantic Web. The focus of this project is to extend the Jena API with the ability to persist graphs/models using a non-Relational Database Model.
High Level Design
Currently, the Jena API provides persistent storage of RDF data using relational database models such as MySQL, Oracle and Microsoft SQL Server. These relational databases are convenient for storage but suboptimal for the storage and access model present in the Semantic Web.
Solrstore adds to Jena the capability to persist RDF/XML graphs by creating the data store directly within a Lucene inverted index structure. Simply said, our approach is to do this without the need for additional software parts and without using an additional RDBMS. While Jena's existing ability to persist RDF/XML graphs to an RDBMS is a convenient storage choice, we argue that an RDBMS is not really appropriate for the Semantic Web, as transaction processing and normalized schemas are not part of the dynamic nature of the Semantic Web domain.
Building upon this argument, the dynamic nature of the Semantic Web is better suited to use versioning in lieu of heavy duty transaction processing. It is our intent to exploit and leverage this inherent capability within Lucene and ultimately present it to Jena developers in an abstract way within the Jena API. Additionally, we believe that the Lucene/Solr indexing engine is underutilized in that it serves primarily as an index with pointers back to the original data source. We intend to not only use Lucene/Solr as an indexing engine, but also as the repository for the data source.
The prime directive for our project is to provide layers of abstraction between the Jena API and the Lucene/Solr APIs. An interface will be created that allows RDF data to be stored and retrieved in a Lucene inverted index structure. This interface will be available with the Jena API. This commitment is extremely important to us as the complexities of our interface should not over burden a Jena developer who might have limited experience with the components within Lucene and Solr.
We acknowledge that the use of an RDMBS to persist RDF/XML graphs within the Jena API was an innovative design choice for the timeframe of its creation. Our team's objective is to evolve that innovation by building upon it with new technologies that are now available and accessible.
While this initiative is not a trivial task, we believe that the objective is important and if successful can benefit the Jena community.
Detail Level Design
The Resource Description Format (RDF) is a XML based standard to describe resources on the web and lets you represent information in the form of a graph, which a set of individual objects and associates between the objects. RDF is one of the key elements of the Semantic Web. Here is an example of RDF statement in plain text:
[resource] |
[property] |
[value] |
LuceneStore |
Uses |
NoSQL |
[subject] |
[predicate] |
[object] |
The underlying structure of an RDF statement is a collection of triples that consist of a subject, a predicate and an object, which correspond to a resource (subject) a property (predicate), and a property value (object).A set of such triples is called an RDF graph. This can be illustrated by a node and directed-arc diagram, in which each triple is represented as a node-arc-node link (the term "graph").
Each triple represents a statement of a relationship between the things denoted by the nodes that it links. Each triple consists of:
a subject,
an object, and
a predicate (also called a property) that denotes an association.
The Jena framework offers various representation modes for RDF triples. Besides memory and file storage, Jena comes with two systems designed to persist RDF and OWL data, the TDB and SDB.
SDB provides scalable storage and query of RDF datasets using conventional SQL databases. SDB supports Microsoft SQL Server, Oracle, PostgreSQL, MySQL, HSQLDB, and Apache Derby. It is specifically designed to support SPARQL.
TDB is a high performance, non-transactional persistence engine using custom indexing and storage. TDB's design goals are to provide the storage layer for both a single machine usage and also distributed clusters of industry standard servers, as found in enterprise data centers.
SDB could be a better choice if you need transactions and have a lot of updates from remote machines whereas TDB is better considering scalability/clustering.
RDF Persistence with LUCENE
Each triple within the graph will be stored as a document within Lucene. Each document will contain the s, p, and o. Lucene, by its nature, is a good place to store semi structure data. The basic unit of storage in Lucene is the Document. And a document is comprised of a set of fields, which can be indexed, stored or both.
Index Assembler Specification
To make it consistent with Jena specification, we are making use of an assembler specification file that facilitates user to provide the index directory location. An Assembler specification is a Resource in some RDF Model. The properties of that Resource describe what kind of object is to be assembled and what its components are: for example, an LuceneIndexDir is constructed by specifying a base model. The specifications for the components are themselves Assembler specifications given by other Resources in the same Model. For example, to specify a memory model with data loaded from a file:
The rdf:type of eg:LuceneIndexDir specifies that the constructed Model is to be a Jena memory-based model. The ja:indexlocation property specifies that where the indexes will be stored on file system.
| www.scottstreit.com facebook Phone: (301) 596-2550 |