Harmonized Cultural Access & Participation DatasetSource:
library(retroharmonize) cap_files <- c('ZA4529_v3-0-1.sav', 'ZA5688_v6-0-0.sav', 'ZA6925_v1-0-0.sav')
European Commission (2012). Eurobarometer 67.1 (Feb-Mar 2007). GESIS Data Archive, Cologne. ZA4529 Data file Version 3.0.1, https://doi.org/10.4232/1.10983.
European Commission, Brussels (2016). Eurobarometer 79.2 (2013). GESIS Data Archive, Cologne. ZA5688 Data file Version 6.0.0, https://doi.org/10.4232/1.12577.
European Commission and European Parliament, Brussels (2018). Eurobarometer 88.1 (2017). GESIS Data Archive, Cologne. ZA6925 Data file Version 1.0.0, https://doi.org/10.4232/1.12959.
retroharmonize project is not affiliated with GESIS.
## This is a dummy path. You should use the path where you saved these files. gesis_dir <- file.path("C:", "your_data", "gesis_files")
## Create full paths to the three selected files cap_files <- file.path(gesis_dir, cap_files)
document_cap_files <- document_surveys(survey_path = cap_files, .f = "read_spss")
The files are rather large. We have
sum(document_cap_files$ncol) variables. We will only harmonize a few of them.
It is likely that the surveys will fit into your computer’s memory, so you can omit the first step. If you work with many files, and you want to keep working sequentially with survey files, it is a good idea to convert them to R objects.
cap_metadata <- metadata_create(survey_files = cap_files)
weigthing_crosswalk_table <- crosswalk_table_create(weight_variables) %>% select ( -all_of(c("val_numeric_orig", "val_numeric_target", "val_label_orig", "val_label_target")) ) %>% mutate ( var_name_target = case_when ( .data$var_name_target %in% c("wex", "wextra", "v47") ~ 'wex', .data$var_name_orig == "rowid" ~ 'rowid', # do not forget to keep the unique row IDs TRUE ~ "geo"), class_target = ifelse(.data$var_name_target %in% c("geo", "v47"), "factor", "numeric") )
The crosswalk table contains the original (source) variable names that must be converted to the target variable names, i.e.
The Eurobarometer surveys contains separate samples for the former West Germany (
DE-W), East Germany (
DE-E), Northern Ireland (
GB-NIR), and Great Britain, i.e. England, Scotland and Wales (
GB-GBN). For the variable
geo the appropriate post-stratification weight is the
w1 variable. For the two German subsamples, it is
w_de, and for the two United Kingdom subsamples it is the
w_uk. We create two new variables, the
w. When the two UK and the two German samples are joined, the appropriate post-stratification weight is
weight_vars <- weight_vars %>% mutate ( country_code = substr(geo, 1,2), w = case_when ( country_code == "DE" ~ w_de, country_code == "GB" ~ w_uk, TRUE ~ w1 )) %>% mutate ( year_survey = case_when( .data$id == "ZA4529_v3-0-1" ~ '2007', .data$id == "ZA5688_v6-0-0" ~ '2013', .data$id == "ZA6925_v1-0-0" ~ '2017' )) %>% mutate ( year_survey = as.factor(.data$year_survey))
We will use two important demography variables: the age of the respondent, and the age of school leaving of the respondent. This latter,
age_education variable is an important, ex ante harmonized variable in Eurobarometer. Because each European country has different education systems, furthermore, because education systems are not the same across different demographic groups (for example, people schooled before the World War II often went to very different schools), a more precise education level would require a very complicated mapping with education-specific knowledge.
The age_education variable has three special values: 1. the value for declined answers, which should be coded in a numeric representation to the special
NA value or R;
An integer outside the valid range of responses, which means that the person is still tudying. We will recode the still studying answers to 0, and later we will further recode them to the current age of the person. We will also mark this special group as students—in their case, the
age_educationhas a different meaning. A 17 year-old respondent in Europe is likely to study further; a middle aged person who replied with 17 has a low education level according to the current norms. In this case, a student’s response of 17 is not the same as the middle aged person’s;
And at last, the special value
No formal educationmeans that the person did not finish the primary school. Different countries define differently the minimum mandatory age when a person can leave the school system. We code these answers to 14, which is lower than the lowest age in our sample (15). This is also a special group, so we will mark them with a dummy variable, too.
demography_crosstable <- eb_metadata %>% filter ( .data$var_name_orig == "rowid" | .data$var_label_orig %in% c("age_exact", "age_education")) %>% crosswalk_table_create() %>% mutate ( var_name_target = case_when( # Do not leave out the unique row identifiers! .data$var_name_orig == "rowid" ~ "rowid", TRUE ~ .data$var_label_orig )) %>% mutate ( na_label_target = case_when( .data$na_label_orig %in% c("DK", "Refusal") ~ "Declined", TRUE ~ .data$na_label_target )) %>% mutate ( val_numeric_target = case_when( .data$val_label_orig == "No full-time education" ~ 14, .data$val_label_orig == "Still studying" ~ 0, TRUE ~ .data$val_numeric_target )) %>% mutate ( na_numeric_target = case_when ( .data$na_label_target == "Declined" ~ 99999, TRUE ~ NA_real_ )) %>% mutate ( class_target = case_when ( .data$var_name_target == "rowid" ~ 'character', TRUE ~ "numeric" ))
Our variables of interest are visiting frequencies to concerts and public libraries. In the Eurobarometer CAP surveys, the answers have
Never in the last twelve monthsor
Nonewill be coded as 0 visits, and their factor label will be harmonized to
never. In this case, the questionnaire reveals that whilst this was an ex ante harmonized question, the answers were not consistently labelled in the three files. We can safely bring these responses to the same value
1-2 timeswill be coded to the programatically easier to use
1_2_times, and it will get a numeric value of 1.5. Later, we will try to establish a better numeric value, now it is important that all responses labelled as
1-2 timesshould have the same numeric code.
3-5 timeswill be coded to the numeric value of 3.5 and 3_5_times.
More than 5 timeswill be
more_than_5and have the numeric value of 6.
- The special case of
DKwill be coded as
declined(to anwer) and the Inap… lables to
inap. These are special codes in Eurobarometer meaning that the item is inappropriate, i.e. due to some filtering, this particular question was not asked from the respondant. While respondents who consciously decline an answer usually form a rather homogenous category of their own, the ianppropriate answers are truly missing answers. Their numeric representation will be both
NA(f.e. they should be omitted from a numeric average.)
search_term_labels <- c("rowid|cultural_activities_public_library|cultural_activities_museums_galleries|cultural_activities_concert|cultural_activities_freq_concert| cultural_activities_freq_public_library|cultural_activities_freq_library_or_archive|cultural_activities_freq_museum_gallery") cap_vars_crosstable <- eb_metadata %>% filter ( .data$var_name_orig == "rowid" | grepl(search_term_labels, .data$var_label_orig) ) %>% crosswalk_table_create () %>% mutate ( var_name_target = case_when( .data$var_name_orig == "rowid" ~ "rowid", grepl("concert", .data$var_label_orig) ~ "visit_concert", grepl("library", .data$var_label_orig) ~ "visit_library", grepl("museum", .data$var_label_orig) ~ "visit_museum", TRUE ~ .data$var_label_orig )) %>% mutate ( na_label_target = case_when( .data$na_label_orig %in% c("DK", "Refusal") ~ "declined", grepl("^Inap", .data$na_label_orig) ~ "inap", .data$val_label_orig == "DK" ~ "declined", TRUE ~ .data$na_label_target )) %>% mutate ( val_label_target = case_when ( grepl("^more", tolower(.data$val_label_orig)) ~ "more_than_5", grepl("never|^not|none", tolower(.data$val_label_orig)) ~ "never", tolower(.data$val_label_orig) == "1-2 times" ~ "1_2_times", tolower(.data$val_label_orig) == "3-5 times" ~ "3_5_times", !is.na(.data$na_label_target) ~ .data$na_label_target, # do not forget the rowid, it should not be labelled .data$var_name_target == "rowid" ~ NA_character_, TRUE ~ "coding_error" )) %>% mutate ( na_numeric_target = case_when ( .data$na_label_target == "declined" ~ 99999, .data$na_label_target == "inap" ~ 99998, TRUE ~ NA_real_ )) %>% mutate ( val_numeric_target = case_when ( .data$val_label_target == "never" ~ 0, .data$val_label_target == "1_2_times" ~ 1.5, .data$val_label_target == "3_5_times" ~ 3.5, .data$val_label_target == "more_than_5" ~ 6, TRUE ~ .data$na_numeric_target ))
We create two version of the crosswalk table. One will rename the vist_ variables to fct_visit, and give them a factor representation.
cap_vars_crosstable_fct <- cap_vars_crosstable %>% mutate ( var_name_target = ifelse( test = grepl("^visit", .data$var_name_target), yes = gsub("visit", "fct_visit", .data$var_name_target), no = .data$var_name_target), class_target = ifelse( test = .data$var_name_target == 'rowid', yes = "character", # the rowid remains a character no = "factor" ) ) cap_vars_fct <- crosswalk( crosswalk_table = cap_vars_crosstable_fct, survey_list = cap)
The other will give these variable a numeric representation.
The average post-stratification weight must be 1 for
w1, and for
w_uk whenever they are not zero values. The
wex variable has no naturally defined mean value.
All concert visits labelled as
inap must be coded to a numeric
NA value. Furthermore, there must not be other
NA values than cases in the
ZA6925_v1-0-0 survey, where this question was not asked. For consistency, these observations should be also coded as
inap in the factor representation.
In case of the
visit_library variable all missing values should have a factor representation of
harmonized_cap_file %>% select ( starts_with(c("id", "fct", "visit")) ) %>% select ( -contains("museum"), -contains("library") ) %>% filter ( is.na(.data$visit_concert) ) %>% distinct_all() harmonized_cap_file %>% select ( starts_with(c("id", "fct", "visit")) ) %>% select ( -contains("museum"), -contains("concert") ) %>% filter ( is.na(.data$visit_library) ) %>% distinct_all()
harmonized_cap_file <- harmonized_cap_file %>% mutate ( fct_visit_museum = ifelse(.data$id == "harmonized_cap_file", "inap", .data$fct_visit_museum)) %>% mutate ( is_visit_concert = ifelse(.data$visit_concert == 0, 0, 1), is_visit_library = ifelse(.data$visit_library == 0, 0, 1), is_visit_museum = ifelse(.data$visit_museum == 0, 0, 1))
You can find this harmonized dataset on Zenodo in the Digital Music Observatory and the Cultural Creative Sectors Industries Data Observatory repositories.
Daniel Antal. (2021). Harmonized Cultural Access & Participation Dataset (Version 20211214) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5781672