The aim of the file is to give a data user a detailed overview of fully harmonised GGP data. First of all, the variables begin with a letter designating the wave of data collection (“a” for the first wave likewise “b” for the second wave). We have attempted to keep the names of variables the same across the waves, and all the new variables would be identified as follows [“wave letter”]n e.g. bn301. The codebook starts with the constructed variables that sum the key socio-demographic characteristics of the respondent. Besides, this file contains the information about the consolidated variables and also which variables are used for the consolidation.
GGS Wave 1 Codebook
All the country specific variables end with an underscore followed by [“country code”]01 e.g. Australia a119_2401. Hence, the country code, as an example, for Australia is 24. While the country specific values are at least 4 digits long (F4 format) and begin with the country code: e.g. Australia 2401. These country specificities are recorded in the GGS Codebook. You can check which variables are available for which country in the Online data analysis suite. Details about the specific methodologies of each country are available in the data documentation section.
The GGP field work is done by the participating countries which usually involves national statistical offices and research institutes. Before submission national teams through the data processing procedures which transforms raw data from field collection into a cleaned and corrected state that is can be used for analysis.
The data editing stage, which is conducted by the participating countries national teams are essential for the further data harmonisation and finalisation procedures. Thus, they are highly advised to follow the GGS standards that are represented in the GGS Harmonisation Data File Description. The framework of how the submitted data should look is represented by the GGS Harmonisation Data File Description. The aim of this file is to give national teams a detailed overview of the data framework: the variable names and the value labels codes. It is also the bridge between the questionnaire and the GGS variables. First of all, the variables begin with a letter designating the wave of data collection (“a” for the first wave likewise “b” for the second wave). We have attempted to keep the names of variables the same across the waves, and all the new variables would be identified as follows [“wave letter”]n e.g. bn301
The national teams are required to complement the data with the data availability report on the submissions. It is of high importance that the participating countries presents a thorough report, because it gives a detailed overview of the data i.e. which questions the country implemented, to what degree the implementations of the questions was participated, what extra information the data contains.
The data is submitted in an already pre-harmonised form. It is prepared and organised according to the GGS standards. Thus, the next important procedure that the data will go through is harmonisation. Harmonisation aims at achieving a clear and comparable format of the GGS micro-data files that would be adequate for cross-country comparison. The harmonisation procedure basically is composed of:
- Label checks
Since it is necessary that in a comparative dataset the variables are consistent across countries, this step will make sure that: all the variables are named the same across the countries and they refer to a particular question in the GGS Questionnaire; the value labels also should be coded in the same manner for all the GGP participating countries
- Dealing with grids
The GGS questionnaire holds several grids of either event history information or members of the household. The Household grid, for example, captures the key information of the respondent and members of the respondent's household.
Such data needs to be harmonized with specific attention to order and logical conistency of grid-rows (be either household members or events such as births). In data sense each row of the grid is represnted by variable name followed by a subscripted number ("_#"). Each subscript thus represents one household member or one event. Part of the grid harmonization is therefore grid sorting. Sorting implies that each superscript refers to the particular member in the household and that the grid rows are sorted according to pre-defined key. For the case of the household grid the household members are sorted according to their relationship to the respondent i.e. the relation to respondent variable (ahg3_# or bhg3_# ). Respondents would appear, first, followed by their partners and children if any and then followed by other household members. As there may be more then one child (or other relative) living in the household they also would need to be sorted. In the case of the household grid, age is used as the secondary sorting key (starting with the oldest person to the youngest).
The routing check is a very important step of the harmonization procedure. Routing check ensures that the structure of underlying data set matches the structure of the GGS questionnaire. Its main goald is to code any given variable in the dataset to either a valid response, nonresponse or skip as indicated in the questionnaire. Consequently, the indicated skip in the quetionnaire is represented with a system missing code (. in STATA, sysmis in SPSS), while the missing information for other reasons is coded into non-applicable/no response (i.e. codes 7, 8, 9 in SPSS or .a, .b, .c in STATA). The routing check therefore examines each and every cell of the dataset and compares it to its corresponding definition in the questionnaire. Action is taken on a couple of distinct occasions. A brief decision process is represented in the diagram.
The final GGS data hamonization step is the consolidation of scattered data. The process consolidates the information scattered over several variables into a single one. Scattering of information often occurs due to simplifactions in questionnaire routing - i.e. paper and pencil questionnaire adaptations for easier interviewing. The consolidation procedure is carried out in the Children Section, the Partnership Section and the Parents and Parental Home Section.
Income is a variable with highest missingness rate. Due to its sensitive nature, the respondents are reluctant to share income information with the interviewer. In order to be able to use income information in a cross country comparative study and not loose too many observations in the process it is necessary to impute the approximately correct distribution of the income variable in each country.
For a more detailed and technical procedure please refer to the Data Cleaning and Harmonisation Guidelines.
Data Cleaning and Harmonization Guidleines (288.12 kB 2009-08-23 21:00:36)