Blog

Why long data eats wide data for lunch

Posted by: Gitahi Ng'ang'a on

Usually, most datasets have one obvious unit of analysis. For example, if you’re collecting data on patients, then the unit of analysis is the patient. Assuming a patient has the variables Name, Sex and Age, a sample dataset with 3 cases might be tabulated as follows.

Name Sex Age
Alice Female 30
Bob Male 40
Clara Female 20

In most statistical software, this layout readily lends itself to much of the analysis you would perform to answer your research questions.

Easy to analyze

Consider the 3 summary statistics below.

  1. Mean age
  2. Proportion of males
  3. Median age by sex

Computing the mean age is as simple as running an aggregation function on the Age column. Similarly, both the proportion of males and the median age by sex can be easily computed by setting up a pivot table in MS Excel or similar software.

Sometimes, however, a dataset contains other possible units of analyses besides the primary unit. Consider the example below.

Suppose you are collecting data on clinics at a hospital. Each clinic has the attributes Specialty and Number of Staff. Further, each clinic is also associated with a list of patients.

The wide format

Now, assuming 2 clinics, with the first having 2 patients and the second one 1 patient, a tabulation of the dataset might look like this:

Specialty Staff Name1 Sex1 Age1 Name2 Sex2 Age2
TB 5 Alice Female 30 Bob Male 40
HIV 7 Clara Female 20

This layout is often referred to as the wide format. It is called the wide format because it grows horizontally to accommodate the clinic with the most number of patients.

In this dataset, the primary unit of analysis is the clinic. You might, for example, calculate the median number of staff. However, it is also possible to perform further analysis with the patient as the unit of analysis e.g. the mean patient age.

Now, notice that while it is straightforward to analyze this dataset with the clinic as the unit of analysis, using the patient as the unit of analysis is much more complicated for the following reasons:

  1. It is not possible to apply a simple aggregation formula on the Age variable since it is now spread out across various columns.
  2. The number of columns in the dataset is not fixed. It grows with the maximum number of patients in a clinic.
  3. Not all cells have data. Clinics with fewer than the maximum number of patients invariably contain blank cells.
  4. Since repeated columns rely on indexed names e.g. Age1, Age2, Age3 e.t.c., errors in column headers can be difficult to detect and fix.

In spite of these limitations, datasets organized in this format are widely regarded as easier to understand (pun intended). The reason here is that each row constitutes a single entry of the primary unit of analysis and there are no superfluous repetitions.

The long format

The alternative to the wide format is the long format. Tabulating the same clinic data from the example above in the long format produces the following table.

Specialty Staff Name Sex Age
TB 5 Alice Female 30
TB 5 Bob Male 40
HIV 7 Clara Female 20

The long format gets its name from the fact that it grows vertically (rather than horizontally) to accommodate the clinic with the most number of patients.

Notice that clinic data is repeated in the rows as many times as there are patients. Also, notice that this format contains a fixed number of columns eliminating the need for indexing.

By simply reorganizing the data this way, analysis on the secondary unit becomes easy again. Every measure resides in its own column, making it straightforward to write aggregate functions and summarize the data in a pivot table.

Surprisingly, most mobile data collection and analysis applications present their datasets in the wide format. They prioritize the presentation of the entire dataset as a single whole over ease of analysis.

A different approach

At Hoji, we take a different approach. We sacrifice the ability to present the entire dataset in a single table in order to achieve easier analysis. In other words, we present the data for the primary unit of analysis separate from data for secondary units of analysis.

This means that each dataset is presented in its most optimal form for analysis. Importantly, Hoji allows data analysts to include any columns they want from the primary dataset into the secondary dataset. This enables them to still associate, as in the example above, patients and clinics.

Personally, I have not found any application where the wide format is superior to the long format. In any case, converting from the long to the wide format is consistently easier on most software applications than doing it the other way round.

Let me know what you think in the comments section below. Have a great week!

4 Responses

  1. Koros Gilbert says:

    I prefer the long format approach

  2. Daniel Mwanga says:

    I have always preferred long format.. esp during analysis…always easier to deal with

  3. Abdirashid Adaw says:

    I prefer long format as it’s easy during analysis

  4. Isaac says:

    Thank you for sharing the approaches to organizing data. Personally, I prefer using the longer format. It makes aggregation of data easy, and equally computing statical basis statistical analyzes is made easy. Mean, mode, median, and range are easy to calculate in the long format.

Leave a Reply