Open research / June 8, 2020

Exploring re-use of datasets in institutional data repositories

As a general rule, academics should cite sources any time that you use someone else’s words, methods, data or ideas in a piece of your own research. With data and code, we see many ways in which re-use occurs, whether it be directly incorporating the data into your own raw data, running someone else’s analysis code on your own data, or even re-visualising the outputs. You can re-use your own data of course, in the same way you cite your own papers as proof of previous findings.

We know that paper publication’s peak citation rate usually happens 2 years after publication. With datasets or other research outputs that are published to support a paper, there is immediately 1 link to the dataset when the journal article is published. This leads to an awful lot of datasets having 1 citation. In fact, until recently, every Dryad dataset had to be associated with a publication, so every dataset had at least 1 citation.

It should therefore be recognised that real reuse of data is only apparent when a dataset has >1 citation, ideally from papers with different authors.

Figshare infrastructure uniquely tracks citation counts to all of the outputs in an Institution’s repository. We do this by looking in the full text of articles for citations to DOIs and not just the reference list. We do this through a partnership with our sister company, Dimensions. Whilst it is still too early to find conclusive evidence of what drives reuse of data, we can investigate a handful of examples as published on our institutional clients infrastructure.

InstitutionOutputTypeCitations from papersPapers with distinct authors
Stockholm Universityhttps://doi.org/10.25378/janelia.6163622.v6Dataset63
Loughborough Universityhttps://doi.org/10.17028/rd.lboro.6176450.v1Dataset42
Royal Holloway – University of Londonhttps://doi.org/10.17637/rh.7000520.v4Code – Listed as Dataset33
University of Adelaidehttps://doi.org/10.25909/5becfa45c176fDataset33
HHMI – Janeliahttps://doi.org/10.25378/janelia.6163622.v6Dataset33

In the above examples, we see a good mix of citations by the same authors, suggesting re-use in the same way that papers are self cited. However, each one of them has been cited by another research group. We want to start peeling back the layers on understanding ‘why’ these research groups have re-used and cited the data/code.

Is compliance with the FAIR data principles a prerequisite for data re-use?

4 of the 5 that were randomly sampled above, would seem to comply with the concept of Findable, Accessible, Interoperable and Re-usable by both humans and machines. The one with the lightest metadata has the following as a description “The data comprises phenotypic measurements from two field trials in South Australia and genotyping information for more than 500 diverse wheat accessions.” There is no README file. And yet this has been re-used and cited by another research group.

Some datasets are better set up for re-use, with authors at the Janelia Research Campus even suggesting ways to reuse the data:

“Some potential projects to do with this data:

1) peer prediction: how well can you predict a neuron from the other 10,000? Can you beat our score?

2) face prediction: how well can you predict a neuron from the behavioral patterns on the face videos?

3) manifold discovery: can you find a nonlinear low-dimensional embedding? how low can it go?”

Interestingly, when we examine the citations, we see the following: “These findings agree well with those of the Stringer et al” and “We test our method on a publicly available calcium imaging data set.”

The amount of logic needed to query why people are citing and re-using non traditional content on institutional repositories means that we are just scratching the surface in understanding what drives researchers to reuse data, let alone understanding how we can encourage this behaviour. The fact that it is happening is a large societal change for a lot of academic research areas. That alone is massively encouraging to see. As the numbers for data citation grow, we will continue to try to understand what encourages these steps. If you’d like to play with the citation data yourself, please get in touch at info@figshare.com

Posted June 8, 2020 in: