Guest post by Mayank Kejriwal.
Publications are the bread and butter of academics and researchers across disciplines, nationalities and institutions. Non-linear variants notwithstanding, the typical academic pursuit consists in devising a research question, doing the necessary background research, formulating and conducting experiments, and then writing it all up. The process is iterative and requires course corrections whenever something unexpected happens, a not uncommon occurrence in a reasonably ambitious research project.
The typical academic pursuit has survived through the ages and still holds to a great extent. But it is also incomplete. The rise of technology and increased standardization of best practices and protocols with respect to conducting and recording research activities, including data collection and access, has made it possible for people not involved (whether in whole or partially) with the original research project to gain a deeper insight into the project beyond just reading the final paper. Even if we limit ourselves only to publications, the rise of the pre-print draft, which is essentially an in-progress (or even preliminary) writeup of the research that gets published on a platform like arXiv and can be cited and downloaded, has led to a more organic and intimate view of the research project. But there is far more to this movement of ‘openness’ than just releasing or ‘publishing’ early drafts of papers on digital platforms.
In fact, a major thrust in the scientific community that has made data a first-class citizen, above and beyond secondary artifacts that the data generates (including publications), is a set of principles known the FAIR data principles. FAIR data is Findable, Accessible, Interoperable and Reusable, and the FAIR data principles were devised as a set of guiding principles for scientific data management and stewardship that are relevant to all stakeholders in the current digital ecosystem. But in my view, the FAIR principles are revolutionary not because it identifies and lays out a blueprint for ‘how’ to fulfill a set of core requirements but because of its emphasis on making data, in itself, the primary unit of discourse in good scientific stewardship. Of course, no scientist has ever disputed the value of data, which is the currency of empiricism. Without data, a theory is just that: a theory. The inductive underpinnings of science have always required data to support hypotheses, but for historical reasons or otherwise, the actual data itself always took a backseat to the more glamorous hypothesis statement, and the publication that made it the center of scrutiny.
Not anymore. There is now an insistence that data should be treated with the same kind of care previously reserved for the carefully written research publication. Data should not be horded away, but should be findable, especially by stakeholders not originally involved in the original research project. In particular, this means that data and supplementary materials have sufficiently rich metadata and a unique and persistent identifier. Data should be accessible: Metadata and data should be understandable to humans and machines, and should be stored in a trusted repository. Data should be interoperable: metadata should use a formal, accessible, shared and broadly applicable language for knowledge representations in the interests of cultivating and supporting standardization as I mentioned before. It is important to understand why standardization matters: without it, scale cannot be achieved. Standardization yields a lingua franca that enables everyone to participate on a level playing field. Finally, and most importantly (in my admittedly biased view), data should be reusable. In terms of implementation, data collections should have a clear usage license and provide accurate information on provenance.
Platforms like Figshare make it possible to treat data and supplementary materials in the same vein as we would an actual publication: most importantly, by making all such materials citable, and by offering all the tools and techniques needed for meeting the FAIR standards. This does not mean that Figshare or any other shared-repository platform (e.g., GitHub) will do the work for you (science was never meant to be easy!), but it will make it easy for you to do so, and by making your materials citable, will also provide you with an adequate incentive. Figshare brings out the expressiveness of modern research, which is now recognized as being about so much more than the final publication itself, especially in data-hungry and computational fields like computer vision and AI, digital humanities and biology. Ultimately, Figshare provides a repository where users can make all of their research outputs available in a citable, shareable and discoverable manner.
We’ve all heard the trope that ‘data is the new oil’, or that ‘data scientist is the sexiest job of the 21st century’. There’s an element of hype to these statements because everyone who has worked with real data knows how difficult it is to re-use and ‘work around’ the quirks of raw data. It is well-known in the data mining community that well over 80% of a data engineer’s time can be spent simply in cleaning and curating data. Given that, it is also undeniable that once cleaned and curated, there is value in that data for others. Thus, there is no reason why someone who has invested effort in providing that value should not be able to make that data citable and (with the appropriate permissions in place) receive the right credit for creating that value. With the rise of repositories, platforms and portals like Figshare, GitHub and arXiv, we are now in a position to expose and illustrate all the rich intricacies of a research project, including data, software, drafts, images, presentations and even negative experimental results that never made it into print, rather than a single monolithic publication.
References
https://libereurope.eu/wp-content/uploads/2017/12/LIBER-FAIR-Data.pdf
https://arxiv.org/https://github.com/
Wilkinson, Mark D., et al. “The FAIR Guiding Principles for scientific data management and stewardship.” Scientific data3 (2016).
Related posts
- « Previous
- 1
- …
- 8
- 9
- 10