Normalising period data using RDF Data Utilities

authored by Frank Lynam at 23/01/2015 10:51:10

RDF Data Utilities is a website that I developed recently to help in the work of mapping messy data to structured RDF data. I found that I was carrying out a certain number of tasks repeatedly as part of the project that I am working on to make the data of the Priniatikos Pyrgos archaeological project available as RDF Linked Data on the Semantic Web. I developed a couple of utilities to automate these tasks and decided to make them available as web services at rdfdatautils.linkedarc.net. Currently the site hosts two utilities. The first handles the normalisation of number data and the second, for which this post is concerned, does the same for period data.

The normalisation of period data is one of those tasks that a lot of Digital History or Archaeology projects needs to deal with sooner or later. Now, ideally any data that enters a database and which may end up being indexed at a future date in time should be subjected to certain constraints at its point of creation. Typically, this comes in the form of the database inputting logic only allowing for certain values to be entered into specific fields. The most obvious way of doing this is to use dropdown lists that ensure inputting human users can only enter data that is part of a controlled list, aka a vocabulary or thesaurus.

For a whole host of reasons, this does not always happen and often a data manager finds him or herself having to deal with a range of values that record period data but do so in an unstructured fashion, typically being entered originally as free text. This is exactly the situation that I encountered during my mapping work with the Priniatikos Pyrgos dataset, which originated as a FileMaker Pro dataset.

You could go through the dataset in question and use your own domain knowledge and good sense to change each of the entries to a set of values that are discreet and therefore more useful for the data’s eventual consumers. This process is known as data mapping and more generally as data normalisation. Usually however, you are not afforded the luxury of going through the data by hand simply because it is too large and would take too long to do so manually and so you have to turn to an automated solution.

The value normaliser that I built and have made available at rdfdatautils.linkedarc.net aims to take the pain out of this common data-mapping task. It was written in Python with Python’s web.py library handles the serving of web pages. Let us consider an example involving the mapping of a field containing period data.

  1. The first thing to note is that rdfdatautils.linkedarc.net works with CSV files only and so you will need to export your data as such.
  2. Here is an example of some data that we are going to input.

name, daterange
001-001,EBYZI to OTTOMAN
  1. The first line lists the field names and the second line contains the single row of data. This is a very simple dataset. It contains two fields, ‘name’ and ‘datarange’. The values for the first and only row are ‘001-001’ and ‘EBYZI to OTTOMAN’.
  2. Go to http://rdfdatautils.linkedarc.net/ and launch the <Value mapping> option.
  3. Choose our input CSV file and then click <Analyse input file> to continue.

Step 1

  1. Now select the fields that you are going to map and how you are going to do this mapping. For our example, select <Custom CSV mapping file> and <Value range>. This means that you want to map all of the values in the datarange field, that you want to do this using a custom mapping file and that the source data might contain a range of values and not just a single value. Note that ranges are picked up in the input data when characters such as ‘to’, ‘and’ and hyphens are encountered. In the screenshot we have also specified that we want the normalised periods to be prefixed with ‘http://example.com/period/’, which effectively transforms it into a URI, in a stroke making it much more Linked Data friendly.

Step 2

  1. The custom mapping file is structured simply. It is a CSV file itself, which contains a <from>, a <to> and an <index> field. An example mapping file is shown below - in fact this is the mapping file used if you select to use the system mapping and not supply your own custom mapping file. The <from> value is an array of strings that the mapping engine uses to search your dataset’s chosen values against. If it matches one, it expands it out to whatever list of strings is found in the corresponding <to> field. Finally, the <index> value tells the engine about the order of the values. This is important for period data as if you are dealing with a range of periods (as we are in our example dataset), you need to know how many periods to include in your outputted data. We will see how this works next.

from

to

index

"prehistoric"

["fn", "emi", "emii", "emiii", "mmi", "mmii", "mmiii", "lmi", "lmii", "lmiii"]

0

["fn", "neo", "finalneolithic"]

"fn"

0

"prepalatial"

 ["emi", "emii", "emiii", "mmi"]

1

"earlyprepalatial"

["emi", "emii"]

1

"emi", "emi"

"emi"

1

"emii", "emii"

"emii"

2

"lateprepalatial"

 ["emiii", "mmi"]

3

"emiii"

"emiii"

3

["em", "earlyminoan"]

"em"

1

["mm", "middleminoan"]

"mm"

4

"protopalatial"

["mmi", "mmii"]

4

"mmi"

"mmi"

4

"mmii"

"mmii"

5

"neopalatial"

["mmiii", "lmi"]

6

"mmiii"

"mmiii"

6

["lm", "lateminoan"]

"lm"

7

"lmi", "lmi"

"lmi"

7

"finalpalatial"

["lmii", "lmiii"]

8

"lmii"

"lmii"

8

"postpalatial"

"lmiii"

9

"lmiii"

"lmiii"

9

["ba", "bronzeage"]

["emi", "emii", "emiii", "mmi", "mmii", "mmiii", "lmi", "lmii", "lmiii"]

10

"historical"

["eia", "archaic", "class", "hell", "eroman", "lroman", "ebyzi", "ebyziii", "ebyziii", "mbyz", "venetian"]

11

["eia", "geometric", "orient"]

"eia"

11

"archaic"

"archaic"

12

["class", "classical", "greek"]

"class"

13

["hell", "hellenistic"]

"hell"

14

"roman"

["eroman", "lroman"]

15

["eroman", "earlyroman"]

"eroman"

15

["lroman", "lateroman"]

"lroman"

16

["earlybyzantine", "ebyz", "earlybyz"]

"ebyzi"

17

["ebyzi", "5", "6", "7"]

"ebyzi"

17

["ebyzii", "8", "9"]

"ebyzii"

18

["mbyz", "10"]

"mbyz"

19

["byzantine", "byz"]

["ebyzi", "ebyzii", "mbyz"]

20

["venetian", "15"]

"venetian"

21

"medieval"

["ebyzi", "ebyziii", "ebyziii", "mbyz", "venetian"]

22

["ottoman", "19"]

"ottoman"

23

["earlymodern", "modern", "20"]

"modern"

24

 

  1. Click <Start mapping> to kick off the mapping process and then after some time (this might take a minute or two depending on the size of your dataset) you will get the results.

Results

  1. The results are displayed as a table on the page and you can also choose to download them as a CSV file via the link at the bottom of the page. You can see that the original field is still there but it is now followed by a number of columns that contain the normalised period data. Each of these columns contains a normalised period value that represents a single period in the range specified. Note also how the values have been prefixed with the prefix that we specified in the previous step.
  2. And that’s it. Your messy free-text period data is now normalised and is a whole lot more useful for users who want to start using your dataset. I have used the example of periods data and these are slightly more complex given that they often involve period ranges and not simply single period values but there is nothing to stop you using rdfdatautils.linkedarc.net to normalise any data field that contains free text data that might be better organised as a controlled vocabulary.

Comments

submit