Open Science

Free sharing of information might be the ideal in science, but the reality is often more complicated. Normal practice today looks something like this:

  • A scientist collects some data and stores it on a machine that is occasionally backed up by their department.

  • They then writes or modify a few small programs (which also reside on their machine) to analyze that data.

  • Once they have some results, they write them up and submit their paper. They might include their data – a growing number of journals require this – but they probably don’t include their code.

  • Time passes.

  • The journal sends them reviews written anonymously by a handful of other people in their field. They revise their paper to satisfy them, during which time they might also modify the scripts they wrote earlier, and resubmit.

  • More time passes.

  • The paper is eventually published. It might include a link to an online copy of their data, but the paper itself will be behind a paywall: only people who have personal or institutional access will be able to read it.

For a growing number of scientists, though, the process looks like this:

  • The data that the scientist collects is stored in an open access repository like 4TU.ResearchData or Zenodo, possibly as soon as it’s collected, and given its own Digital Object Identifier (DOI).

  • The scientist creates a new repository on GitHub to hold their work.

  • As they do their analysis, they push changes to their scripts (and possibly some output files) to that repository. They also use the repository for their paper; that repository is then the hub for collaboration with their colleagues.

  • When they’re happy with the state of their paper, they post a version to arXiv or some other preprint server to invite feedback from peers.

  • Based on that feedback, they may post several revisions before finally submitting their paper to a journal.

  • The published paper includes links to their preprint and to their code and data repositories, which makes it much easier for other scientists to use their work as starting point for their own research.

This open model accelerates discovery: the more open work is, the more widely it is cited and re-used. However, people who want to work this way need to make some decisions about what exactly “open” means and how to do it. This is one of the (many) reasons we teach version control. When used diligently, it answers the “how” question by acting as a shareable electronic lab notebook for computational work:

  • The conceptual stages of your work are documented, including who did what and when. Every step is stamped with an identifier (the commit ID) that is for most intents and purposes unique.

  • You can tie documentation of rationale, ideas, and other intellectual work directly to the changes that spring from them.

  • You can refer to what you used in your research to obtain your computational results in a way that is unique and recoverable.

  • With a version control system such as Git, the entire history of the repository is easy to archive for perpetuity.

At TU Delft, researchers are able to store 1TB/year of data for free on the 4TU.ResearchData data repository. You can organise your work into projects, and even connect your GitHub repository to link processing scripts to data stored in 4TU.ResearchData! For more information, contact ____.

Interested in learning more about working reproducibly with code and data? Join the Open Science Community Delft! ____ info.