The labelled_spss_survey class • retroharmonize

Large international survey programs that provide data access to their data, such as the aforementioned Eurobarometer, Afrobarometer and Arab Barometer, or the European Values Survey, or Lationbarometro usually give access to the data in SPSS files. In some cases, Stata files are available, too. Our examples always start with SPSS files.

In R, the haven package provides functions to import and export data to and from IBM’s proprietary SPSS files. We had to realize that while haven::read_spss works perfectly with importing the data and metadata of a single survey, it is not always suitable for multiple survey files, for two important reasons. SPSS files contain the survey data in a coded form, with the coding (labelling) metadata optionally included to each variable.

Variables imported with inconsistent labelling cannot always be concatenated. For example, unlabelled age variables are imported to an R numeric vector, whilst when the age 18 is labelled 18 years or 18 éves, it is imported to a labelled class.
The SPSS variables do not handle various missing cases in a complete and unambiguous form. In an age variable, 998 and 999 may be labelled as not asked and declined to answer, or simply the numerical range between 120-999 may be marked as a range of numeric values representing missing cases.

One practical problem in the Eurobarometer surveys, which targets the population of at least 15 years old Europeans is that in some standard questions, such as the age of finishing full-time education, 10 represents a special missing value, whilst in other surveys it may be a perfectly valid numerical answer. The real coding problem is that SPSS users can freely chose to use explicit labelling of missing cases, using a numerical range for missing cases, or not providing any missing case metadata at all.

Our importing functions rely on two new S3 classes. A single survey is imported into a survey() class, which inherits all the properties of a modern data frame, i.e., it is a tibble or tbl_df from tibble in the tidyverse, but it includes as much metadata from the original SPSS file as possible. These metadata attributes are handled in a way that they can facilitate proper documentation and a reproducible workflow. Furthermore, it converts labelled variables into the retroharmonize_labelled_spss class inherited from haven::labelled_spss with more consistent handling of missing value ranges and labels. See ?labelled_spss_survey.

Working with our new retroharmonize_labelled_spss class can be very cumbersome, particularly with simple harmonization tasks. In the case of harmonizing a single question from two surveys, this may not be practical, and a simple crosswalk table can help with spotting and correcting inconsistent codes.

Our survey objects or retroharmonize_labelled_spss vectors can be converted to base R classes with the as_numeric(), as_factor() or as_character() methods. When computing a numerical average, the special age value of 10 is converted to NA_real_ as a numeric. In other statistical applications, missing and special values are best represented as categories—this calls for the factor representation. The character representation is often more useful for visualizing the data then the factor representation.

Create A labelled_spss_survey Vector

library(retroharmonize)

Use the labelled_spss_survey() helper function to create vectors of class retroharmonize_labelled_spss_survey.

sl1 <- labelled_spss_survey(
  x = c(1, 1, 0, 8, 8, 8),
  labels = c(
    "yes" = 1,
    "no" = 0,
    "declined" = 8
  ),
  label = "Do you agree?",
  na_values = 8,
  id = "survey1"
)

print(sl1)
#> [1] 1 1 0 8 8 8
#> attr(,"labels")
#>      yes       no declined 
#>        1        0        8 
#> attr(,"label")
#> [1] "Do you agree?"
#> attr(,"na_values")
#> [1] 8
#> attr(,"class")
#> [1] "retroharmonize_labelled_spss_survey" "haven_labelled_spss"                
#> [3] "haven_labelled"                     
#> attr(,"survey1_name")
#> [1] "c(1, 1, 0, 8, 8, 8)"
#> attr(,"survey1_values")
#> 0 1 8 
#> 0 1 8 
#> attr(,"survey1_label")
#> [1] "Do you agree?"
#> attr(,"survey1_labels")
#>      yes       no declined 
#>        1        0        8 
#> attr(,"survey1_na_values")
#> [1] 8
#> attr(,"id")
#> [1] "survey1"

You can check the type:

is.labelled_spss_survey(sl1)
#> [1] TRUE

The labelled_spss_survey() class inherits some properties from haven::labelled(), which can be manipulated by the labelled package (See particularly the vignette Introduction to labelled by Joseph Larmarange.)

haven::is.labelled(sl1)
#> [1] TRUE

labelled::val_labels(sl1)
#>      yes       no declined 
#>        1        0        8

labelled::na_values(sl1)
#> [1] 8

It can also be subsetted:

sl1[3:4]
#> [1] 0 8
#> attr(,"labels")
#>      yes       no declined 
#>        1        0        8 
#> attr(,"label")
#> [1] "Do you agree?"
#> attr(,"na_values")
#> [1] 8
#> attr(,"class")
#> [1] "retroharmonize_labelled_spss_survey" "haven_labelled_spss"                
#> [3] "haven_labelled"                     
#> attr(,"survey1_name")
#> [1] "c(1, 1, 0, 8, 8, 8)"
#> attr(,"survey1_values")
#> 0 1 8 
#> 0 1 8 
#> attr(,"survey1_label")
#> [1] "Do you agree?"
#> attr(,"survey1_labels")
#>      yes       no declined 
#>        1        0        8 
#> attr(,"survey1_na_values")
#> [1] 8
#> attr(,"id")
#> [1] "survey1"

When used within the modernized version of data.frame, tibble::tibble(), the summary of the variable content prints in an informative way.

df <- tibble::tibble(v1 = sl1)
## Use tibble instead of data.frame(v1=sl1) ...
print(df)
#> # A tibble: 6 × 1
#>   v1               
#>   <retroh_dbl>     
#> 1 1 [yes]          
#> 2 1 [yes]          
#> 3 0 [no]           
#> 4 8 (NA) [declined]
#> 5 8 (NA) [declined]
#> 6 8 (NA) [declined]
## ... which inherits the methods of a data.frame
subset(df, v1 == 1)
#> # A tibble: 2 × 1
#>   v1          
#>   <retroh_dbl>
#> 1 1 [yes]     
#> 2 1 [yes]

Coercion rules and type casting

To avoid any confusion with mis-labelled surveys, coercion with double or integer vectors will result in a double or integer vector. The use of vctrs::vec_c is generally safer than base R c().

# double
c(sl1, 1 / 7)
#> [1] 1.0000000 1.0000000 0.0000000 8.0000000 8.0000000 8.0000000 0.1428571
vctrs::vec_c(sl1, 1 / 7)
#> [1] 1.0000000 1.0000000 0.0000000 8.0000000 8.0000000 8.0000000 0.1428571

c(sl1, 1:3)
#> [1] 1 1 0 8 8 8 1 2 3

Conversion to character works as expected:

as.character(sl1)
#> [1] "1" "1" "0" "8" "8" "8"

The base as.factor converts to integer and uses the integers as levels, because base R factors are integers with a levels attribute.

as.factor(sl1)
#> [1] 1 1 0 8 8 8
#> Levels: 0 1 8

Conversion to factor with as_factor converts the value labels to factor levels:

as_factor(sl1)
#> [1] yes      yes      no       declined declined declined
#> attr(,"label")
#> [1] Do you agree?
#> Levels: no yes declined

Similarly, when converting to numeric types, we have to convert the user-defined missing values to NA values used in the R language. For numerical analysis, convert with as_numeric.

as.numeric(sl1)
#> [1] 1 1 0 8 8 8
as_numeric(sl1)
#> [1]  1  1  0 NA NA NA

Arithmetics

The median value is correctly displayed, because user-defined missing values are removed from the calculation. Only a few arithmetic methods are implemented, such as

median()

median(as.numeric(sl1))
#> [1] 4.5
median(sl1)
#> [1] 1

quantile()

quantile(as.numeric(sl1), 0.9)
#> 90% 
#>   8
quantile(sl1, 0.9)
#> 90% 
#>   1

mean()

mean(as.numeric(sl1))
#> [1] 4.333333
mean(sl1)
#> [1] NA
mean(sl1, na.rm = TRUE)
#> [1] 0.6666667

weighted.mean() - always removes NA values.

weights1 <- runif(n = 6, min = 0, max = 1)
weighted.mean(as.numeric(sl1), weights1)
#> [1] 2.777608
weighted.mean(sl1, weights1)
#> [1] 0.603679

sum()

sum(as.numeric(sl1))
#> [1] 26
sum(sl1, na.rm = TRUE)
#> [1] 2

The result of the conversion to numeric can be used for other mathematical / statistical function.

as_numeric(sl1)
#> [1]  1  1  0 NA NA NA
min(as_numeric(sl1))
#> [1] NA
min(as_numeric(sl1), na.rm = TRUE)
#> [1] 0