Connecting Infrastructure, Connecting Research

Development of a chemical properties database

Name: Keiron Taylor
Institution: University of Southampton
Research: Development of a chemical properties database

The explosion of data produced by chemists since the invention of the periodic table in 1869 requires new methods of dealing with it. The NGS has been working with researchers at the University of Southampton to develop methods of handling that much data. Chemical data now needs multiple annotations and metadata associated with it to make any real sense to the user. It’s not enough anymore to just indicate the boiling point of a liquid. Researchers wanting to use that information also require other details such as:

  • the pressure the measurement was taken at
  • who did the experiment
  • when was the experiment performed
  • where was the experiment performed

The provenance of a piece of data is of extreme importance to the end user. They need to be able to trace back from publication to the original data.

The need for this information to be stored and easily available to researchers increases the difficulty of maintaining chemical databases. And with the areas of computational and combinatorial chemistry pushing the rate at which chemical data is produced, a ‘semantic web’ approach is being taken by Kerion Taylor and colleagues.

Researchers at Southampton University have created a Resource Description Framework (RDF) triple store for chemical data. With the help of expertise from the NGS, they are now looking into whether the use of the Oracle 10.2 and 11G Databases can improve the speed of querying. They are also working together to demonstrate the possibility of querying multiple, distributed RDF triplestores.

The Oracle RDF triplestore hosted on the NGS is being used along with the RDF triplestore already developed at Southampton University. The aim is to be able to dynamically combine data from both triplestores without needing to copy the entire databases over. Traditional relational databases are not flexible enough to deal with the data and it’s continuously changing annotations. RDF on the other hand, is. The RDF triplestore allows for more complex queries to be run on the data. Its power is in its ability to use triples to represent a relationship between a subject and an object.

Relationships between objects and subjects can be explicitly defined or implied by other relationships. Using the RDF triplestore avoids the need for a traditional relational database. These often require long-term design and maintenance that is not an option in the academic world. RDF triplestores have another major advantage over relational databases for this research. Even though different triplestores may store differing information for the object of interest, they can still share information.

The Oracle RDF data management capabilities of high query performance and triple loading capabilities means storing databases with 100 million triples is not a problem. Tools such as Oracle Jena are used for triple loading and storage. In addition, the Oracle 11G semantic database provides features such as ontology-assisted querying of relational data.

A tutorial on running Oracle on the NGS is available.

Download a summary slide of this case study