I am designing a piece of software that will make extensive use of ontologies. For my purpose, an ontology is a list of “terms”, each with a name and description, where the terms can be related to each other in a number of ways and the relations form a directed acyclic graph (DAG).
BTW, I suggest that before you go on, you read this article. The author is trying to do something very similar to what I’m doing, and his article is excellent, with code examples and diagrams and everything. I have devised a solution that differs in some ways.
Example
See the article.
Problem
The problem is that we often want to be able to quickly pull out all terms that are related, directly or indirectly, to another term. This requires quickly finding all terms that have a path to a particular term.
Furthermore, we want the database design to be DBMS-independent, so it cannot use special features particular to certain DBMSes. In particular, I’m trying to avoid stored procedures.
Inspiration
Not wanting to re-invent the wheel, I investigated various database packages that use ontologies, including GO and BioSQL
I also did some reading and came across this very helpful article (already mentioned multiple times)
Solution
I decided that each ontology would be a completely self-contained unit. Hence, a term in one ontology could not be related to a term in another ontology.
First of all, the schema:
Schema
Four tables: ontology, term, term_rel, and term_path.
- ontology: contains a name and description of the ontology
- term: contains a name and description of the term, a foreign key to the ontology, and flags indicating that the term is the root term or is obsolete
- term_rel: defines direct relationships between terms (edges on the DAG):
- id: unique id
- subj_term_id: foreign key to the subject term (a.k.a. the descendant, on the non-arrow side of the edge)
- obj_term_id: foreign key to the object term (a.k.a. the ancestor, on the arrow side of the edge)
- pred_term_id: a term from the “predicate” ontology indicating how the terms are related (e.g. “is a”, “has a”)
- ontology_id: redundant foreign key to the ontology to which subj and obj belong; makes queries easier
- term_path: defines all paths
- (contains all the same fields as term_rel), plus
- term_path_id: nullable reference to another term_path (explained below)
- term_rel_id: reference to a term_rel (explained below)
- distance: number of edges on the DAG in this path
Logic
Term path has to be maintained when term_rels are added and deleted. This logic was inspired by the aforementioned very helpful article, though I have made significant changes (that make it simpler and less DB-dependant, though it does require outside logic).
Each term_path is identified by an integer primary key. However, it is also defined by another term_path (call it the subpath) and a term_rel. The subpath is that path which contains all edges of the main path, except for the last one. So, if the path is A -> B -> C -> D, then the subpath is A -> B -> C. The term_rel is the last edge, in this case, C -> D. Paths that contain only a single edge will have no subpath (term_path_id is null). All paths have a term_rel.
When adding a new relationship from A -> B, do the following:
- Add a path from A -> B with distance 1. It will have no term_path_id, and term_rel_id will reference the term_rel just added.
- For every existing path, P, to A (with subj_term_id = A’s id), add a new path P* from P’s subject to B.
- Take all paths added in #1 and #2 and put them in a stack
- If the stack is empty, then you are done.
- Pop the top path (Pt) off the stack
- Find all edges (term_rels) that can extend Pt. These would have term_rel.subj_term_id = term_path (Pt).obj_term_id. If there are none, go back to step 4.
- For each of those edges, create a new path, extending Pt, just as you did for P in step #2, and push the new edge onto the stack.
- Go to step 4.
That’s additions. Now for deletions. Deletions are trivially easy if your DBMS supports cascading deletes. Simply set the term_path.term_path_id and term_path.term_rel_id foreign keys to cascade deletes. Now, when a term_rel is deleted, all paths that include it wil be as well.
If your DBMS does not suppor this, then you have two options:
- Use a database abstraction layer that will emulate it.
- Implement the logic yourself
I leave #2 as an exercise to the reader
Implementation
This actually works, by the way. I’m using it.
I am doing this in a Symfony project, using the Propel for ORM. So, I am able to use cascading deletes, as Propel will do the work for me if the underlying DBMS will not.
The code is available upon request. I will be happy to provide it for you. Since it’s all done within the Propel framework, it may be hard to follow for those not familiar with Propel. Your best bet is to hit me up via my work web page.