1 - Data quality

Image representing tidy data — Arrange your data using Tidy Data principles

Scenario - Are your datasets future-proof? Can your provide your data and methods?

“Tidy data” principles

As your research progresses, new possibilities for enquiry might be revealed, and you may want to use a new method to analyse your data. Depending on how you recorded and stored your data, it may take a lot of time-consuming work to reconfigure your datasets before you can process the new analysis. While all the necessary data may exist within your datasets, there may be no simple and reliable method for extracting it.

The principles of tidy data developed by Hadley Wickham offer a way of recording your data, be it in spreadsheets or database tables, so that every element within those datasets can easily be accessed for new analysis and modelling. Even if a dataset needs to be reconfigured before being imported into the new software, it will be a simple process to produce the necessary file.

The fundamental principles of tidy data are:

Each variable is a column
Each row is an observation
Each cell contains one value.

By arranging all the variables of an observation along a single row, all elements about that observation are contained within a clearly delineated record: the single row. With all variables recorded in columns, these observations of a patient, a survey respondent, an historical event, a biological sample etc, are reliably linked to their subject.

Tidy data provides both the stability and the flexibility to reliably process and export data to any system; including running processes and modelling which later become desirable or significant for your research.

In a spreadsheet, this format must not vary; neither to accommodate additional data for an individual observation, nor to create an aesthetically pleasing layout e.g. merged cells.

With the requirement for each cell to contain a value, it is important for each cell to contain only one. This will isolate and label the information about a variable, and simplify the process of accessing it when required. As an example, when recording the addresses of respondents, the use of a single cell or line for the entire address makes potentially useable data difficult to access. By dividing these details across several columns, the potential exists to model additional data based on location.

First steps

Read Hadley Wickham’s article on Tidy data. Using a spreadsheet could you apply some of these principles to your data from the beginning of a project or during a project?

Intermediate steps - learn ways to clean and wrangle your data

To take datasets from outside sources or from inexact processes such as webscraping and prepare them for analysis along with your tidy data, you may need to:

convert the format or structure of the data
remove duplicate or irrelevant observations
standardise the format of observations
change naming conventions
correct errors
or preform many more adjustments to make a dataset suitable for your processes.

The terms “data cleaning” and “data wrangling” are often used interchangably but they can be defined separately as:

Data wrangling: transforming and mapping a dataset
Data cleaning: removing or correcting data

While small datasets could be altered manually, large scale datasets will require specialised functions or software to process multiple values.

Learn to apply some tidy data methods with this online module using the open source data cleaning software OpenRefine.

Data Quality and Relational Datasets

Everyday more and more data is generated and collected, much of which is suitable for some form of analysis. Depending on the nature of your research, you might have data from multiple sources as separate data tables. When you are working with multiple data tables, this is known as having relational data.

This is because the data sets have relations between certain variables in a pair of tables, not just the individual datasets. It is important to note that although you may have any number of data tables, the relations are a defined between the pairs first and foremost.

relational data desciption — A generic example of the relational data table concept

Data quality, espcially within the context of relational databases can take on a set of diverse definitions. An often cited framework is total data management quality (TDQM) theory. (Wang & Strong, 1996)

TDQM theory is defined below:

Data quality category	Data quality dimensions
Intrinsic data qualtiy	Believability, Accuracy, Objectivity, Reputation
Contextual data quality	Value-Added, Relevancy, Timeliness, Completeness, Appropriate amount of data
Representational data quality	Interpretability, Ease of understanding, Representation consistently, Concise representation
Accessibility data quality	Accessibility, Access Security

Further reading on the TDQM theory is available at:

Wang, R. (1998). Quality Ma A Product Perspective on Total Data. COMMUNICATIONS OF THE ACM, [online] 41(2). Available at: http://web.mit.edu/tdqm/www/tdqmpub/WangCACMFeb98.pdf.

Advanced steps - learn to tidy data with code

Learn R for Data Science, specifically the tidyr package, or Python for Data Science and the pandas package.
Attend an Introductory coding course held at your institution such as Griffith’s Coding Workshops.
If you are really new to coding we recommend completing a number of the short interactive, online lessons from 100 Days of Code - The Complete Python Course by Replit pior to attending an R or Python software carpentry workshop.

Introduction 2 - Cloud backups