8 - Version Control

Image representing versioning with numbers
Version allows you to keep track of changes to your data.

A version control system allows users to keep track of changes in your data or processes.

Are you keeping track of any versions or logs made by the software in use?

Make sure you have a copy of every step you have completed and if possible, version numbers for the program you are using and any libraries. Programs change over time and this can alter your results if someone asks to replicate your work post publication.

Never make alterations to your raw data files

Instead, make a copy of the raw data files and keep them in a dedicated folder, somewhere safe like Research Drive, or for long-term storage Research Vault. That way, if you need to redo your work or you find an error earlier in your workflow, you have an original baseline to start from.

Write down versions of analysis software

Write down the versions of analysis software (like SPSS or NVIVO etc) AND hardware (MRI machines etc). Your documentation is a great place for this, but even in your lab notebook will work.

Random Number Generator

If you are using random numbers in your research, save your random seed generator number as part of your working data. This way, you can later reproduce your results.

Why use version control?

It kind of sounds like a lot of effort, so why would you want to use version control? What are the benefits?

To avoid this!

Image representing what happens with bad version control
Uncontrollable versions can derail your workflow.

There are three key advantages of implementing version control practices or procedures that you should keep in mind:

  1. Infinite undos: If you control your versions, between active or live and archived, there will always be scope to reverse or recover a previous copy of a document before changes were made.
  2. Branching and experimentation: Being able to effectively make a copy of documentation or code, and identify it as such, you are therefore able to test changes and hypotheses. An example would be taking a copy of commonly used code, labelling it with a descriptive prefix in the file name to identify it is a test version before testing methods to speed up how quickly the code runs.
  3. Collaboration: Allowing multiple people to contribute to a document or code can be an immensely powerful advantage that speeds up the progress of research. From using track changes in word processors, or GitHub for code, these platforms often manage the responsibility of highlighting changes, and merging them when required.
First steps

Copy your raw data to to a dedicated RawData folder in a cloud storage solution, such as Research Drive for safe keeping.

Intermediate

If you are using a workflow program like Galaxy, KNIME, or a virtual lab like EcoCloud or the Australian Text Analytics Platform - ATAP, you can copy your workflow and save it as part of your documentation. Write the date that you ran the workflow if versions of the software are not available.

Advanced

If you are writing scripts (R/Python/Matlab etc), use Git.

Note: Griffith has a gitlab version you can use for private repositories. Also record the version of R/Python/Matlab, the operating system you are using and the version numbers of any library you are using.(??link to how to apply for gitlab??)

If you are using the HPC, also record the version of any modules you used there.

SUPER Advanced

If you’ve heard of Docker or Singularity and you are interested in using the functions, talk to Griffith hacky hour/eResearch Services.

Internal Resources

External Resources