Jan 21 2012

RDF Nuts and Bolts or Get me a “LASER”

Published by at 10:52 am under Uncategorized

I’m obsessing about a talk I’m giving at one of my favorite conferences, code4lib.  My talk proposal is about how we deal with whatever metadata comes our way.  For those of you not inside my head at this moment, “we” is where I work, part of which is in developing software for and maintaining a Digital Asset Management System (DAMS).  A digit asset is just a computer file, or set of files – often a picture, sound file, PDF, or video – that you have some desire to promote beyond just sitting on one person’s computer, unmanaged.  We all have computer files coming out of our ears, but we know there are some that are more “valuable” than others that we’d like to give special treatment.  So we call them digital assets and get then moved into some sort of management system beyond the random file systems on our desktops.  This system is a DAMS.

I see a DAMS as a secure, reliable file system set up with good organization rules, and a goal of making the assets easy to find.  Here are some of the rules I like to see followed:

Good Organization

  • Describe the assets as thoroughly as possible and/or practicable.  This is your metadata.
  • Save this metadata and keep it associated with the assets.
  • Keep the metadata in a system that is flexible enough to be able to accomodate attributes that you didn’t know about when you started – people change their minds about what they want all the time.  We chose RDF for this and this is what my talk will be about.
  • Create unique identifiers for the assets.  We chose the ARK spec for this.

Reliability

  • Store the assets as simply as possible – don’t create a new file system because no one is going to understand it 30 years from now.  We chose the Pairtree spec for this.
  • Back your assets up – think like an IT person and have a data lifecycle plan.
  • Preserve your assets and metadata – think like a Librarian/Archivist and store the assets in at least three, geographically separated places.  We use Chronopolis for this.
  • Export your metadata to the file system on a regular basis – this way the metadata becomes a digital asset/computer file too.

Discoverability

  • Put all that lovely metadata into a full text search index.  We use solr for this.
  • Tie all of the metadata to the unique identifier for the assets.

 

As I noted above, I’m only talking about RDF and metadata at code4lib.  What I’m obsessing about is that the talk is only 20 minutes.  I usually talk about our DAMS in about an hour, and I’m only getting warmed up in the first 20 minutes.  So I’ve got to empty my head of all this other DAMS stuff and laser down on just the RDF and metadata part.

We didn’t choose RDF because of the newish Linked Open Data (LOD) movement. Our (now retired) architect, Chris Frymann, was aware of the possibility but this was nearly ten year ago and LOD was barely a twinkle on the horizon.  Previous to this job, I had been working in industry for years, so this approach looked silly and academic.  Once Chris had me drink the RDF Kool-Aid, we envisioned a system that embraced flexibility from the start.

RDF is so simple and so terrifyingly different from the fixed database world that I was used to.  Instead of a well defined table, or tables, we had millions of triples.  We didn’t even have a triple store, just three columns in an SQL database.

What is wonderful about this approach was that each triple is somewhat self documenting.  A triple is made up of a Subject, a Predicate, and an Object:

 

RDF Triple

 

We use our asset’s unique identifier, the ARK, as our Subject.  Now we needed to describe the assets with their metadata – so we started creating Predicates that could hold types of metadata.  Three years later…. no really.  This was probably one of the hardest things to do, and I’m not sure we’ll ever stop doing it.  Some of our original assets had MARC records, and there were ways to convert MARC to RDF.  Lots of deep discussions among metadata librarians, asset owner librarians, and the tech folks came to the conclusion that we wanted to cast our metadata into specific namespaces, namely MODS, PREMIS, and MIX.  This was way beyond the Dublin Core defaults that other products were using, but we knew RDF was flexible enough to accomodate just about anything, so we just did it.

Guided by the head of our Metadata Analysis and Specification Unit (MASU – lots of great detail at that link), Brad Westbrook, we started specing out what the metadata needs were for each asset.  Ok, that’s a lie… We did it per “collection” which was how we actually received assets from the librarians.  Our DAMS works at the asset level, but our librarians normally think at the collection level.  This was just another layer of translation that the MASU group stepped up to play a liaison role in getting the assets ingested.  Over time, this became a workflow where:

  1. Collections are identified/approved for ingestion into the DAMS.  (The project management and institutional buy in on this process is another talk!)
  2. MASU creates an Assembly Plan that maps the assets into collections and specifies what namespaces the metadata pieces are placed in and hand it off to IT Development
  3. IT Development creates mapping scripts from what the Assembly Plan calls for into RDF.  This is done in XSL.
  4. IT Development ingests the assets into the storage system and parses the metadata into RDF in the triple store.
  5. Profit.

Ok, someone tell me how to get all that into 20 minutes… 😉  The Assemble Plan alone is an intense spreadsheet and text document that explains what is needed.  Then the translation scripts are another challenge to present without everyone going cross eyed.  Not to mention this thing of beauty!

DAMS RDF Graph

That’s all of the metadata and relationships of one asset.  Maybe I’ll just put that on the screen and take questions for 20 minutes… 😉

5 responses so far

5 Responses to “RDF Nuts and Bolts or Get me a “LASER””

  1. Richardon 21 Jan 2012 at 12:24 pm

    D,
    What does the DAMs do with the RDF besides store it and the assets? Is it used to build indexes, export packages, etc.? Do those applications just use a sub-graph from the thing of beauty?

    I’m guessing from the genID nodes here that there are lots of blank nodes. The linked data approaches frown on this because those IDs can change. Have they caused you any particular trouble here? (and how do you work around them if exposing LOD?)

  2. declanon 21 Jan 2012 at 2:59 pm

    Thanks for the questions, Richard!

    The solr index is stuffed with the metadata. We style the RDF into JSON, then ingest it into solr for the text search and some faceting. We do cherry pick what metadata we expose this way – some of the internal triples aren’t very interesting beyond our access control mechanisms.

    Yep, those are blank nodes, and they allow us to be more specific about certain kinds of predicates. MODS has depth that we don’t want to lose. The blank nodes also allow us to maintain the order of things if needed. On their own, triples have no sequence, they all stand alone. We’re using blank nodes to label what’s first, second, etc.

    We’re not sure how to practically play in the LOD world. Our current work is looking at schema.org. This would require some flattening of the blank nodes, but that can all be rule based and automated. Such fun!

  3. Ed Summerson 24 Jan 2012 at 11:58 am

    Awesome post Declan. It would be really cool to hear about the details of your system, and how RDF is used; but I think you are right, it would take longer than 20 minutes. Perhaps if you whet people’s appetite you could have a breakout session that goes into more of the details of the DAMS. That is, if you aren’t too hung over.

    dchud has a nice way of organizing his talks by “leading with the punchline”. I think I am remembering that right; but maybe he didn’t make it up, and got the advice from somewhere else. So I’m passing that along. You might want to think about the essential point you want to communicate, and kicking off with that, and have everything else in the talk organized to support that message.

    So is your message “use RDF because you can say whatever the hell you want and you don’t need to change your schema”? If so would you roll w /a NoSQL solution if you were doing things again today? Why or why not?

    Or is your message “RDF is a really flexible way to describe stuff and we’ve used it in these ways, but we’d like for our community to describe things similarly so we can work together more?” In that case it might be useful to take stock of the predicates you cooked up, to see what all is out there now that could be used now; and where there are gaps. The DCMI has been perennially interested in these things called Application Profiles which could allow the digital preservation community to define a grab bag of vocabularies that are useful for description. I wonder if that could be a useful thing to try to bubble upwards. Or maybe it’s tilting at windmills…

    Most likely your message is something else, and I am putting words in your mouth.

    I’m also left wondering how much of the data in your DMS is out on the Web for people to use, either in your organization, community or out on the public web. Having the data out there for others to use is a really important part of the equation I think.

    So, a few thoughts. Seeya in Seattle.

    [emoticons cleaned out by edsu’s request]

  4. Trish RoseSandleron 08 Feb 2012 at 11:38 am

    Declan,

    I didn’t catch the talk (so not sure how well the 20 min limit worked out for ya) but I think you did a good job in your slides of conveying the UCSD Libraries process of moving digital collections from submission to ingestion to accession. As someone who was part of the metadata analy and spec unit it was interesting to see this process all documented. Of course I had a sense of how it worked once it left MASU’s hands but this presentation gives the full picture. I think you were smart in not giving the nitty gritty details of the object specs – probably not of interest to the audience and they just need to know what purpose it serves.

    I didn’t realize the rdf triples were initially stored as 3 columns in a db. Is that still the case or are they now stored as triples? I do think Chris Frymann was ahead of his time in pushing RDF. Hopefully it will serve you well as you move into the waters of LOD. Good luck!

  5. declanon 12 Feb 2012 at 3:39 pm

    Hey Trish! The talk went well! We store them both ways, as columns in SQL and in a real triple store called Allegrograph. Chris was certainly ahead!

    Hope you are doing well in STL!

    D

Trackback URI | Comments RSS

Leave a Reply