1The current system to monitor the health behaviour of the population living in France is based partly on a series of cross-sectional surveys of the general population. One of the oldest and most regular of those surveys is the Health Barometer (Baromètre santé), conducted every five years since 1992. The survey is currently managed by the French public health agency, Santé publique France. 
2The surveys are conducted on the basis of a random sample of telephone numbers. The methodology has evolved over the years to adapt to new technical and administrative constraints resulting from the end of the national telecommunications monopoly and the subsequent diversification of telecoms equipment and usage (Beck et al., 2013; Beck and Guilbert, 2007; Richard et al., 2014).
3For the past 20 years, these self-report surveys have monitored the main risk-taking and health behaviours, attitudes and perceptions in the population living in France (Beck, 2011). The surveys yield many health indicators (DREES-DGS, 2013; DREES, 2015), and the retrospective data can be used in retrospective event history or longitudinal analyses. They have notably been used to document the distribution of tobacco and cannabis smoking in the population (Bricard et al., 2015; Legleye et al., 2011, 2014, 2016; Pampel et al., 2015).
4Many of these data cannot be collected in any other way. They include information on individual behaviours that is not otherwise recorded (e.g. use of illicit substances or the number of alcoholic drinks consumed on a single occasion, because sales figures or household budget breakdowns indicate only the availability of alcohol) and above all on the population’s attitudes, perceptions and opinions.
5However, these cross-sectional telephone surveys offer little opportunity to chart life courses reliably (and within a reasonable timeframe, compatible with the conduct of a survey), because of frequent and socially differentiated recall errors, which can distort the results. It is also extremely difficult to gather accurate information from respondents about their health status (chronic illnesses or certain conditions), healthcare consumption or welfare benefits.
6Moreover, the surveys cannot be used to monitor respondents over time. Useful research could be conducted on mortality, for example, if it were possible to know the causes of death of the participants in each of the surveys. A recent example based on an epidemiological survey conducted in Lorraine in 1996 has shown the potential benefits of matching survey data with mortality data (Khlat et al., 2014). Having access to data on the respondents’ working careers and on their health status between the survey and their death would increase the robustness of analyses.
7The purpose of our work is to provide an initial basis for such a discussion. Specifically, we aim to measure the differentials in respondents’ sensitivity to the cross-linking of their personal data, and thus to contribute information about the selection biases likely to be induced by a study based only on individuals who agree to provide identifying data (social security number or name and birth details) directly.
8To this end, we conducted an experiment which involved asking respondents to the 2014 Health Barometer whether they would agree to provide identifying information that could be used to obtain administrative data about their personal situation. The study was designed to test the acceptability of two different requests: for their social security number, and for their first name, surname, and date and municipality of birth. Here, we give a brief overview of the Health Barometer method, the experimental protocol and its quantitative results. In particular, we describe the consenters in terms of both their socio-demographic characteristics and their self-rated health in order to test for the existence of biases liable to affect the representativeness of the sample of consenters.
I – State of the art
9The collection of prospective information is the preserve of cohort studies. On many themes, cohort studies are much more informative than cross-sectional surveys, especially for exploring causal relationships. But there are few large cohort studies representative of the general population that record health behaviour; most pursue a more directly medical purpose and target specific populations, or are less concerned with representativeness (Goldberg et al., 2013). Attrition is also a source of bias in cohort studies. In the French version of the Generations and Gender Survey (Étude des relations familiales et intergénérationnelles, ERFI), on an initial sample of 10,079 individuals, attrition was 43% after six years (after two waves of data collection) (Régnier-Loilier and Guisse, 2012). In the health and social protection survey (Enquête santé et protection sociale – SPS),  attrition by the second wave, i.e. after four years, was around 55% of the initial sample of some 12,000 individuals (Jusot et al., 2008). In Tempo,  attrition was around 60% of the initial 1,103 participants. Statistical adjustments can, of course, be made to offset the impact of attrition, but they greatly reduce accuracy and are impossible when the number of participants is too small. Furthermore, attrition is highly socially selective and may be linked to health status (which is often harder to correct through adjustment), which again raises the question of the extent to which the results can be generalized. Ultimately, while the data gathered through successive surveys of a cohort are invaluable, these studies are expensive and complex to organize, so it is hard to imagine mounting representative cohort studies of sufficient size, specialized in all of the relevant public health domains.
10Panel and cohort studies are now frequently matched with administrative or medical records. For instance, the SPS survey sample is drawn from the registry of beneficiaries of France’s general health insurance scheme and matched with the national healthcare system claims database (SNIIRAM ); Constances  is matched with data from SNIIRAM and the national pension fund (CNAV); Tempo with SNIIRAM, etc. This matching enhances accuracy and quality because certain items of information no longer need to be collected directly from respondents. In addition, all of those cohort studies are matched with cause-of-death data from CépiDc at INSERM. One of the advantages of matching is that it provides a means to gather information about participants who have dropped out of the sample, and thus to describe and control attrition, at least in part.
11Similarly, many public statistical surveys use social security and tax data, notably to draw the samples. Less commonly, medical administrative data are used to expand the databases; the ten-year health survey of 2003 and the disability and health survey of 2008-2009 were the first to do this (Montaut et al., 2013).
12Could cross-sectional general population surveys outside the domain of public statistics be similarly improved by incorporating information on income, career and health status through matching with administrative, medical and mortality data? The context of these surveys differs from public statistical surveys in that they are conducted without a sampling base and by telephone, which reduces the survey acceptability, the possibilities of statistical adjustment and the participation rate, thus making matching with external data of this type all the more beneficial.
1 – Social security number and name and birth details
13The French social security number (NIR) theoretically gives access to a range of medical and administrative data:
- Data held by the SNIIRAM-PMSI, produced by CNAMTS and by ATIH;
- dates and causes of death (from INSEE and CépiDc-INSERM);
- career data from the national pension fund (CNAV);
- certain tax data.
14Obtaining such information would make it possible to conduct longitudinal analyses of mortality or healthcare consumption, to test questions on health behaviour and healthcare take-up, and to propose statistical adjustment strategies.
15Name and birth details can be used to recover a person’s social security number through a search of the national register (RNIPP) or the national interscheme health insurance register (RNIAM), managed by CNAV, and thus match the survey file with all of the data mentioned above.
2 – Legal context
16In practice, there are administrative and legal barriers to matching these data. To obtain such information in a survey, respondents must be explicitly asked for their informed consent, which means that they must be informed about the consequences of agreeing to provide their data, and the request for consent must be expressed in plain language. Until the legislative changes of 2016, a decree from the Council of State was required to implement the procedure (Article 27 of France’s Data Protection Act 78-17 of 6 January 1978).
17The legal context changed recently with the passage of the Healthcare Modernization Act in January 2016 and the reading of a new bill on the conditions for mining administrative databases (Digital Republic Act 2016-1321 of 7 October 2016). The new laws are intended chiefly to simplify the conditions governing use of social security numbers, which will no longer require, as was the case until now, a decree from the Council of State but only permission from the French data protection agency CNIL, on condition that the social security numbers are encrypted automatically as soon as they are collected. Use of encryption at the data entry stage requires the use of approved software, but that is not an insurmountable challenge for a telephone survey.
18Any request for permission to search databases for a non-interventional study or evaluation in the field of health is subject to an evaluation by an expert committee, which decides on the pertinence of the process in relation to the declared purpose, and an opinion from the national institute of health data (INDS), which decides on the public interest of the declared purpose. Nevertheless, the ongoing legislative changes will facilitate secure, anonymous matching with administrative data and clearly call for a discussion of the use of matching in general population surveys. The collection of social security numbers may remain more tightly regulated in terms of security than the collection of name and birth details. Therefore, the discussion should distinguish between the two types of data to be collected.
II – Methods
1 – The 2014 Health Barometer
19The Health Barometer is a cross-sectional telephone survey based on probability sampling. It is not a public statistical survey and is not compulsory. It is based on a two-stage random sample: first, the telephone numbers are generated randomly, and second, each individual is selected randomly from among eligible household members. The 2014 survey was conducted by IPSOS by phone, using computer-assisted telephone interviewing (CATI). Data collection in the field took place from 11 December 2013 to 31 May 2014. Considerable efforts were made to facilitate the interviews: telephone appointments were offered if the person was unavailable; numbers were dialled up to 40 times before they were abandoned; and interviewers were trained in arguments designed to convince the largest number of eligible people to take part. The telephone numbers were deleted at the end of the interviews, so all data collected remained strictly confidential and were analysed solely for statistical purposes.
20The Health Barometer survey design was modified in 2000 to include people with an ex-directory number (Beck and Guilbert, 2007), in 2005 to include people who only have a mobile phone, and in 2010 to include people with fully unbundled lines. Thanks to these new developments, the survey has become more representative and, most importantly, it is now possible to include people with specific characteristics in terms of health behaviours. In 2014, because a section of the population prefers to use a mobile phone, including some of those who also have a landline, two “overlapping” (i.e. not separate) samples were constituted: one surveyed by landline, the other by mobile phone, with no filter on the type of telephone used by the household. The sample comprised a total of 15,635 individuals (7,577 landlines and 8,058 mobiles). The participation rate was 61% for the landline sample and 52% for the mobile sample. The questionnaire interview took 33 minutes to complete on average.
21The data were weighted by the number of eligible individuals and phone lines in the household (the multiple probabilities of inclusion in the two overlapping samples were therefore taken into account by sharing the weightings), and fitted with the most recent national reference data from INSEE at the time of the survey, i.e. the 2012 Labour Force Survey. This fitting took account of gender crossed with age group, region of residence, size of locality of residence, educational level, and living alone or not.
2 – The experiment: two questions at the end of the questionnaire
22The experiment consisted of a simple randomization of the sample of respondents. At the end of the questionnaire, one-fifth were asked, “For the purposes of a scientific study, would you be willing to give us your social security number?” and another fifth was asked, “For the purposes of a scientific study, would you be willing to give us your first name, last name, place and date of birth?” (the latter data are referred to hereafter as “name and birth details”). The possible responses were: yes, no, don’t know/don’t want to answer. Our experiment therefore concerned a hypothetical case and did not actually gather the information mentioned.
3 – Socio-demographic and health variables
23The socio-demographic characterization was performed using conventional socio-demographic variables (shown in Table 1). The variables describing objective or subjective health status were: daily smoking, daily alcohol consumption, reporting at least one major drinking episode per month (six drinks or more on a single occasion), obesity (defined as a BMI ≥ 30), sedentarism (no physical activity in the past 12 months), having forgone medical treatment for financial reasons in the past 12 months, poor self-rated health, reporting a chronic illness or health condition, and mobility restrictions for at least six months due to a health problem (none, moderate, severe). Their association with socio-demographic characteristics is shown in Table 2.
III – Results
1 – Respondents were more willing to give their name and birth details than their social security number
24The question on giving their social security number was asked of 3,044 people, and the question on giving their name and birth details was put to 3,041 people. Overall, 34.9% (n = 1,114, sd = 1.06%) of the respondents said they would be willing to give their social security number, 64.4% said they would refuse (n = 1,912, sd = 1.07%) and 0.7% (n = 18, sd = 0.23%) did not wish to answer. The corresponding percentages for name and birth details were: 51.9% (n = 1,572, sd = 1.10%), 48.0% (n = 1,463, sd = 1.10%) and 0.16% (n = 6, sd = 0.07%). There was thus far more acceptance of the second request. The non-respondents were excluded from the subsequent analyses.
Socio-demographic and health characteristics of respondents and percentage who agreed to provide social security number or birth details
|Social security number||Name and date/place of birth|
|N||Percentage who agreed(a)||N||Percentage who agreed(a)|
|Less than upper sec. (Ref.)||597||30.6||583||55.3|
|2 years higher ed.||400||41.2||375||51.9|
|3+ years higher ed.||715||41.7||723||49.1|
|Size (pop.) of locality of residence||0.261||0.554|
|Farmer, self-employed (Ref.)||217||35.1||228||50.4|
|1st tercile (low) (Ref.)||725||31.2||720||51.8|
|3rd tercile (high)||1,076||43.9||1,143||54.1|
|Don’t know/No answer||177||7.3||180||26.1|
Socio-demographic and health characteristics of respondents and percentage who agreed to provide social security number or birth details(a) For each variable, the first line indicates the p-value associated with Pearson’s Chi2 test and the following lines indicate the percentages associated with the different modalities.
p-values below 0.05 are in bold.
25Table 1 shows that men were more likely than women to agree to either request, but that the acceptance rate did not vary significantly with age. There was, however, an education and socio-economic bias. Agreeing to provide one’s social security number was more common among the most educated respondents, those in higher-level or intermediate occupations and those in the most affluent households, and less common among manual workers and the few people whose occupational category could not be identified. Agreeing to provide one’s name and birth details tended to be more frequent among respondents with lower educational levels (even if non-significant), but remained more frequent among individuals with high incomes. Although the difference between occupational groups was small, manual workers were more willing than people in higher-level occupations to provide name and birth details (55.0% versus 53.1%).
Logistic model of agreement to provide social security number or birth details (percentage and adjusted odds ratios – aOR – with 95% confidence intervals – 95% CI); Models adjusted for occupational category/income/CU/call sample (landline or mobile)
Logistic model of agreement to provide social security number or birth details (percentage and adjusted odds ratios – aOR – with 95% confidence intervals – 95% CI); Models adjusted for occupational category/income/CU/call sample (landline or mobile)(a) For each variable, the first line indicates the p-value associated with Pearson’s Chi2 test, the following lines indicate the percentages associated with the various modalities.
(b) For each variable, the first line indicates the p-value associated with the test of overall significance of the variable in the model, the following lines indicate the adjusted ORs associated with the various modalities.
(c) See the description of the variables in the Methods section. The p-values below 0.05 are in bold.
26We did not find any significant association with the size of the locality of residence, occupational status or living alone for either question. Conversely, for both questions, people surveyed over a landline agreed to answer more frequently than those on a mobile.
27Table 2 shows that in a bivariate analysis, willingness to provide a social security number was significantly associated with reporting chronic health problems (42.0% versus 31.2%) or functional limitations (33.2% among individuals who did not report any restricted movement, 40.0% of the others). After controlling for the main structural effects, agreeing to provide one’s social security number is a primarily masculine behaviour, particularly common in the 15-24 age group. Willingness to provide this information seems to have very little connection to health behaviours: it was significantly more frequent among individuals reporting chronic health problems (OR = 1.52) and tended to be less frequent among sedentary individuals than the others (OR = 0.81).
28Willingness to provide name and birth details appears more linked to health characteristics (Table 2, bivariate analysis): we found more frequent acceptance among daily drinkers (59.5% versus 50.8%), individuals in poor self-rated health (65.9% versus 50.8%), who reported a chronic illness (57.5% versus 49.1%) or functional limitations (59.6% of those who said their movement was severely restricted and 53.0% of those who said their movement was restricted, versus 50.9% of the others), as well as an increased tendency among daily smokers (54.6% versus 50.8%). All other things being equal, Table 2 shows that the associations remain the same for men (OR = 1.39), the 15-24 age group (all the ORs in the higher age groups are significantly below 1), poor perceived health (OR = 1.58) and chronic illness (OR = 1.44), while the relationship to sedentariness becomes significant (OR = 1.53).
29For both analyses, we found that age, which was not significantly linked to agreement to either proposal, did have a significant association in the multivariate model: consent to either request was more common in the 15-24 age group than in the rest of the population. Conversely, functional limitations were associated with agreement, but this was no longer the case in a multivariate analysis. The associations with income persist, but the link with occupational category only persists with social security number, whereas employment status is never significant (data not shown).
30The breakdown shows that the individuals surveyed by mobile phone agree less often to provide their name and birth details than those surveyed over a landline telephone (OR = 0.55), but this is not the case for social security number (OR = 0.94).
IV – Discussion and conclusion
1 – Summary of results
31A higher percentage of people appear to be willing to provide their name and birth details than their social security number (51.9% versus 34.9%), men more frequently than women, with a social gradient that shows greater willingness among individuals with higher education, higher-level occupations (social security number) and higher incomes (social security number; name and birth details). The individuals surveyed via mobile phone also accepted the requests less often, while the reverse pattern was found for individuals who reported health problems, were sedentary, drank alcohol daily or were in poor self-rated health.
2 – Interpretation
32These results show that providing individual data that would enable access to medical and administrative data appears to be acceptable to a high percentage of respondents in a general population survey such as the Health Barometer, in particular for name and birth details. Agreement also follows a moderate social gradient, and is more common among people in poor health. The latter point can be interpreted in two ways. First, it suggests that the approach did not put off people with health problems, who might have feared an investigation of their healthcare spending or consumption, confirming the potential usefulness of the approach for surveys. Second, it can be interpreted through leverage-salience theory (Groves et al., 2000), whereby the decision to participate in a survey results from a trade-off between the effort required to participate and the respondent’s positive interest in the theme and in the potential benefits of the results. It is indeed likely that the people in poor health felt more concerned by the survey’s focus on health issues, and gained more satisfaction from answering questions that seemed more personally relevant and to which they could provide specific answers. That satisfaction could explain their more frequent acceptance of a request to provide their social security number or name and birth details. This greater willingness to take part, notable in all surveys when their theme and aims are explained to respondents, is confirmed here.
33Various hypotheses can also be put forward to explain the diverging results for the requests for social security number and those for name and birth details. The difference seems to indicate that respondents see an identity number as a more sensitive piece of information, more effective than birth details for identifying them and tracking down their personal data. This is a rational assessment: a social security number is indeed a highly identifying item of information, but we did not necessarily expect that reaction, given that writing one’s name on a document is generally perceived to be an official, binding action. However, the effort of memory required to provide one’s social security number in a telephone survey may have been the reason for some refusals. This outcome suggests that, in order to maximize the response rate, it might initially appear preferable to request name and birth details rather than social security number; however, the selection biases generated by that choice would be different, since people in poorer health would be more selected if name and birth details were used. Such information might usefully guide the data collection protocol of future cross-sectional or cohort surveys. Lastly, the fact that the individuals surveyed via mobile phone were more reluctant than the others to consent to either request for identifying information could also be a consequence of the survey conditions: it is likely that some of those respondents were not at home or were with other people when they were surveyed, or at least in a less familiar setting than those surveyed on their home landline. They might therefore have been reluctant to provide personal information, which might be overhead, over the phone.
3 – Comparisons with other studies
34To our knowledge, no similar experiment has been conducted as part of a random telephone survey. Since the samples for the SPS and Constances surveys are drawn from health insurance records, the medical and administrative data are known for all the selected individuals, both respondents and non-respondents. That is the reverse situation to ours, and a much more advantageous one. Survey matching has been performed since 1970 (INSEE-CREDOC ten-year health survey), but is not a common practice. More recently, two-thirds of the respondents to the 2003 ten-year health survey, and 75% of respondents to the 2008-2009 disability and health survey agreed to provide their social security number. In those two cases, the acceptance rates were much higher, as were the participation rates. We should, however, point out that INSEE interviewers surveyed the respondents at home and the disability and health survey was compulsory. These contextual elements have a positive impact, as does public awareness of INSEE, which is much greater than that of smaller organizations like INPES. The authors of the matching experiment for the disability and health survey nevertheless emphasized major practical difficulties in identifying and matching social security numbers with medical and administrative databases (Montaut et al., 2013). We have no comparable example for the collection of name and birth details.
4 – Limitations
35Our study concerned a hypothetical situation; some of the people who consented in the experiment might refuse in a real-life situation. However, that effect could be offset by explaining the purpose of the survey and the conditions under which the data would be used, given that the rates of acceptance in the ten-year health survey and the disability and health survey were much higher than ours (Montaut et al., 2013). The acceptance rate observed for the social security number could therefore be considered as a minimum. Indeed, in our experiment, only a scientific study is mentioned, without any indication of purpose: no specific arguments are put forward. Moreover, we could not assess the quality of the social security numbers and names and birth details that people might have provided. Indeed, research on surveys that have collected social security numbers suggests that a non-negligible percentage of the numbers collected could not be matched, resulting in possible socio-demographic and health-related biases. While respondents are probably far more familiar with their names and birth details, which they more frequently provide on official forms than their social security number, they cannot always be used to retrieve the social security number, so this would limit the success of the collection operation.
36It would have been useful to ask the consenters’ and refusers’ opinions on the collection method or the type of identifying data requested (social security number or name and birth details). Each respondent was asked only to provide one type of data, but some refusers might have agreed to provide the other type. Unfortunately, as the length of the questionnaire was restricted, it was not possible to add more questions.
37Moreover, no official requests for authorization were made, and major restrictions and limitations might apply to the implementation of the procedure. However, the examples from the ten-year health and disability survey and the ten-year health survey show that various difficulties can be overcome by means of technical solutions, notably the instant encryption of social security numbers and of name and birth details to ensure data anonymity.
38Our study covered one stage in a very long process. A dedicated experimental study would be needed to assess the final quality of the operation.
5 – Consequences for future health barometers
39Matching with health data could therefore be considered for future health barometers. This would make it possible to monitor the sub-sample of volunteers longitudinally and to expand conventional analyses with medical and administrative variables. The tested procedure involved a request at the end of the questionnaire for consent to match the respondents’ data: while it is tempting to restrict the sample to volunteers for obvious economic reasons, such a protocol would identify the consenters at the beginning of the questionnaire, so another experiment would be needed. Moreover, if consent were requested earlier in the questionnaire, it might increase the refusal rate, or cause respondents to drop out because they do not understand the purpose or value of the survey.
40The measured acceptance rates are still low and would need to be increased to take full advantage of matching opportunities: the protocol requires further discussion. For example, an explanation could be provided to those who hesitate and, given the major differential in willingness to provide social security number versus name and birth details, social security number could be requested first and, in the event of refusal, name and birth details, in order to maximize the success of the subsequent matching process. To make the sample of volunteers more representative, efforts could also be made to convince the most reluctant people, i.e. in our study, those who are the least socially and economically advantaged or who refuse to disclose their household income. Another theoretically more effective way forward, could be to draw the sample from the state health insurance databases, as IRDES did for the SPS survey. However, that would involve a complete reworking of the current survey protocol.
Glossary of acronyms
41ATIH: Agence technique de l’information sur l’hospitalisation (Technical agency for information on hospitalization)
42CépiDc: Centre d’épidémiologie sur les causes médicales de décès (Centre for the epidemiology of medical causes of death)
43CNAMTS: Caisse nationale de l’assurance maladie des travailleurs salaries (Health insurance scheme for salaried workers)
44CNAV: Caisse national d’assurance vieillesse (National old-age insurance fund)
45CNIL: Commission nationale de l’informatique et des libertés (French data protection authority)
46EPRUS: Établissement de préparation et de réponse aux urgences sanitaires (Health emergency preparation and response unit)
47INDS: Institut national des données de santé (National institute of health data)
48INPES: Institut national pour la prévention et l’éducation à la santé, (National institute for health prevention and education)
49INSERM: Institut national de la santé et de la recherche médicale (National institute for health and medical research)
50InVS: Institut de veille sanitaire (French institute for public health surveillance)
51RNIAM: Répertoire national inter-régimes de l’assurance maladie (National inter-regimes health insurance register)
52RNIPP: Répertoire national d’identification des personnes physiques (National register for the identification of physical persons)
53SNIIRAM: Système national d’information inter-régimes de l’Assurance maladie (Health insurance inter-regime information system)
Institut national de la statistique et des études économiques (INSEE), Paris (France); CESP, Faculté de médecine, Université Paris Sud, Faculté de médecine UVSQ, INSERM, Université Paris-Saclay, Villejuif, France.
Correspondence: Stéphane Legleye, INSEE, 18 bd Adolphe Pinard, 75014 Paris, France, email: firstname.lastname@example.org
Santé publique France; CESP, Faculté de médecine, Université Paris Sud, Faculté de médecine UVSQ, INSERM, Université Paris-Saclay, Villejuif, France
INSERM, CépiDc, Le Kremlin-Bicêtre, France
Santé publique France was formed in 2016 from the amalgamation of InVS, INPES, EPRUS and ADALIS. The Baromètre santé surveys were previously managed by INPES.
Since 1988, the health and social protection survey (Enquête santé et protection sociale) has gathered data on health status, health insurance, social status and use of healthcare from a sample of 8,000 ordinary households, consisting of 22,000 individuals. The survey is conducted every two years and the same households are interviewed over four years (www.irdes.fr).
Tempo is a cohort study of the children of the participants in the Gazel epidemiological cohort, set up by INSERM in 1989 in cooperation with the electricity and gas utility EDF-GDF. Tempo seeks to improve understanding of the various health needs of young adults in France (http://www.tempo.inserm.fr/).
See glossary in Appendix for a list of the acronyms used in this article.
Constances is a “generalist” epidemiological cohort, consisting of an initial sample of 200,000 adults aged 18-69 who have attended a social security health screening centre (www.constances.fr).