1Quantitative surveys employ an ever wider range of data collection methods. The development of telephone surveys in the 1990s offered a cheaper alternative to facetoface interviews (Riandey and Firdion, 1993). Later, as the number of cell phone owners increased, technical advances made it possible to conduct detailed interviews and to directly select random samples of adults by means of cell phone numbers. More recently, the spread of the Internet has made it possible to conduct surveys with minimal data collection costs on “spontaneous” samples of volunteers or on representative samples of respondents recruited by phone. These innovations raise questions of sample representativeness and of the effect of data collection mode on the responses obtained. Analysing the Dutch version of an international survey on tobacco control, Mary Thompson and her colleagues compare the results obtained on two subsamples, one that responded via the web, and the other by phone. They highlight differences linked to the respondents’ characteristics and others attributable to the data collection mode, and present a method for estimating the results that would have been obtained using a single collection method, i.e. that used in the other countries.
2As obtaining a probability sample from a survey population becomes more and more difficult, survey practitioners are turning increasingly to mixed mode survey methods (Blyth, 2008). Telephone surveying is ideal for many purposes, because the questionnaire is administered by a person trained to elicit information and to keep the respondent engaged, and because travel costs are eliminated (Roberts, 2007). Random digit dial (RDD) methods (Groves et al., 1988) were developed to overcome the only serious issue with the frame in high penetration countries, namely the fact that many numbers are not listed. However, in recent years telephone frames have become less useful, partly because of the proportional increase in cellphoneonly households (Blumberg et al., 2006; Blumberg and Luke, 2008), and to an even greater extent because of access control technologies such as call display and automatic screening (Roberts, 2007; Tuckel and O’Neill, 2001).
3One approach to mitigating the problem is to use additional frames and data collection modes in order to increase population coverage and response rates. Selfadministered web data collection is particularly attractive because there are no interviewer or data entry costs (Blyth, 2008; Roberts, 2007). Thus, there is a great deal of interest in developing web survey frames and methods to encourage timely and good quality responses to surveys hosted on the web. Early on, web survey frames were mainly lists of email addresses gathered for other purposes. However, many survey firms (e.g. Harris Interactive and TNS NIPO) are developing databases and panels of people willing to respond to surveys on the web for appropriate compensation. The initial recruitment into the panel is often conducted by telephone or email, and the databases or panels can be described as “rich” frames because the respondent’s personal data can be collected at recruitment. The combination of data collection modes presents new dataquality and analytic challenges (Frippiat and Marquis, 2010).
4An immediate problem with combining telephone and web survey results, either across or within surveys, comes from the differences in the way questions are processed cognitively by respondents. For sensitive or difficult questions, or questions for which socially desirable responses exist, the presence or not of a human interviewer often makes a difference in how a respondent answers. For questions with a large number of response options, it is easier to choose with accuracy from a list which is seen than from a list which is heard. Thus, it is commonly observed that telephone respondents tend to give more socially desirable responses (Moskowitz, 2004; Kreuter et al., 2008), and more recently heard responses (Bishop et al., 1988). They are also more likely to choose the extreme ends of a 5point Likert scale (Dillman et al., 2009). These kinds of phenomena can be called mode effects of the administrative sort or administration effects. Bowling (2005) has reviewed the effects on data quality of the mode of questionnaire administration.
5Other types of mode effects found with mixed mode surveying can be called selection effects. These arise when the sample of respondents in one mode cannot be considered as a random subsample of the whole sample, as far as their characteristics are concerned. In some mixed mode surveys, respondents are recruited from a single frame, and either assigned to a mode or allowed to choose between modes (De Leeuw, 2005). If respondents are allowed to choose, the subsamples for the two modes may ultimately differ on key characteristics. The other main kind of mixed mode design is a dual frame design, where respondents to the two modes are recruited from different but overlapping frames. Coverage by the frames, as well as nonresponse biases, may be expected to differ for the two modes (Nagelhout et al., 2010), again leading to different response distributions for the subsamples.
6In simple comparisons of results from the two mode subsamples, administration and selection effects will be confounded. For example, if a higher proportion of a web subsample were to admit to a certain risky behaviour, we might suspect that this involves an administration effect in the sense that it is easier to admit such a thing when no interviewer is present; however, there might well be alternative explanations in terms of selection effects, associated with differences in the distributions of age or areas of residence in the subsamples.
7In more complex comparisons, where variables associated with selection effects are controlled, it is more plausible to assume that remaining differences are administration effects. It may then be possible to model the administration effects when the telephone subsample and the web subsample are combined. This article illustrates an approach to such modelling that can be carried out with standard statistical software. The aim being to account for mode effects in the analysis of data from mixed mode surveys, it is not necessary to model the various sources of administration effects separately.
8A useful concept for quantifying selection effects is the “propensity” to respond by one mode or the other (Rosenbaum and Rubin, 1983). Theoretically, this is the probability of responding by (say) telephone, given the fact of being in the combined sample of respondents, as a function of demographic variables X, and additional characteristics W which might influence the mode of response. In some applications, the propensity might be interpreted as the probability that the respondent chooses to respond by telephone, given a choice; in our application, it is the more simply the probability of having responded by telephone, conditional on having been contacted through one method or the other and having responded. It can be shown that, given a particular value of the propensity strictly between 0 and 1, the telephone and web parts of the sample are balanced with respect to the distribution of X, W. We cannot know the true propensity function, but we can approximate and estimate the propensity using a logistic regression model, regressing an indicator for responding by telephone on the covariates X, W. The resulting propensity score formula quantifies the selection effects which depend on these variables. [1]
9Controlling for propensity score in comparing the results from the two sample parts allows us in principle to separate administration mode effects from selection effects (i.e. from differential coverage and nonresponse bias). That is, if we compare telephone and web respondents with the same propensity score, the average mode differences will not be confounded with the variables X and W, and are therefore more likely to be administration effects.
10When no assumptions are made concerning the directions of selection and administration effects, these effects are confounded, as pointed out by Vannieuwenhuyze et al. (2010) who advocate disentangling them by comparing a mixed mode data set with a corresponding single mode data set. In this article, we take the selection effect to be aligned with the propensity score, and the administration effect as the mode effect controlling for the propensity score. With these definitions, the two effects can both be estimated.
11The responses may depend on covariates, possibly including those in X and W, other than through the propensity score. In that case, if we add covariates to the model, transformed to be orthogonal to the propensity score, we should still be able to interpret the coefficient of the propensity score as a selection effect. Where we omit such variables, as with the examples in Section III, then the estimated administration effect can be thought of as estimating the difference in response distribution – answering by telephone and web – for populations with the same joint distributions of X and W, as determined by conditioning on propensity score values.
12In the final illustrative analyses presented in Section IV, we use the estimated propensity score not only to measure the selection effects, but also to control for them. If the set of variables in X and W seems likely to account for the selection effects and is not too extensive, simply controlling for X and W rather than for the propensity score will suffice.
13When modelling responses that contain mode effects, the approach will surely depend on the purposes. However, in some cases it seems reasonable to take one mode to be the standard, with the effect of the other mode being characterized in terms of parameters in the model. For a specific outcome, the dependence on mode can then be expressed as (i) a dependence on X and W or the propensity score, quantifying the selection effects, and (ii) a transformation relating the patterns of responses in the standard mode to the patterns of responses in the other mode, quantifying the administration effects. In our illustration, using data from the International Tobacco Control Survey in the Netherlands (the ITC Netherlands Survey), the choice of the telephone mode as standard is arbitrary; it is not our intention to advocate a preference for one mode over the other, but to show how the results from both modes can be combined in a mixed mode design.
14This article aims to suggest a way of modelling simultaneously the administration and the selection effects for outcomes measured on a fivepoint ordinal scale, and a way of incorporating these effects into analyses in a multimethod study.
15This approach to modelling can be used in several different ways: (i) to test for administration and selection effects in the response patterns for individual questions and to estimate their magnitude (as illustrated in this section); (ii) to test for common administration and selection effects in groups of questions; (iii) to use the administration effect parameters to “predict” the distribution of a respondent’s response by telephone, given her/his response by web; and (iv) to account for collection mode effects when combining the web and telephone samples for analysis. As indicated, this article focuses on the first and last of these uses. In Section III, we show the results obtained when administration and selection effects are estimated by modelling for some questions in the ITC Netherlands Survey, and in Section IV, we give an example of a combined sample analysis.
16This article is organized as follows. Section I introduces the data used for the model. In Section II the model is described in detail. In Section III the model is applied to selected questions from the ITC Netherlands Survey. Section IV presents the results of embedding the model in a crosscountry comparison, and Section V is devoted to discussion.
I – The data
17The International Tobacco Control Policy Evaluation Project (ITC) conducts longitudinal surveys, mainly of adult smokers, in 20 countries in order to evaluate policy measures being implemented under the World Health Organization Framework Convention on Tobacco Control (FCTC). In most of the countries, data collection is carried out either by telephone or facetoface. However, mixed mode surveying has begun to enter the ITC Project. A description of the conceptual model of the ITC Project and the methods used in the earliest ITC surveys can be found in Fong et al. (2006) and Thompson et al. (2006).
18The ITC Netherlands Survey, a survey of adult smokers with an oversampling of younger adults, differs from the other ITC surveys (prior to 2008) because most of the participants are responding to the CAWI (ComputerAssisted Web Interviewing) form of the questionnaire. TNS NIPO, the firm carrying out fieldwork in the Netherlands, has recruited an access panel (essentially a rich frame) of over 140,000 people from the general population for web surveys. The access panel is a nonprobability panel recruited by TNS NIPO by phone or mail, but not by Internet. Since it is not possible to apply for participation, the panel has a relatively low number of “professional respondents”, who participate in many web surveys as a way to generate income (Willems et al., 2006). Those invited to participate in the ITC Netherlands Survey constituted a stratified random sample from the panel. Web surveying has become the preferred survey method in the Netherlands, as telephone surveying is not generally seen as costeffective in that country and almost the entire population has Internet access (European Commission, 2008). It is a limitation of our analysis that the access panel is a nonprobability panel, and we try to compensate for this fact in part by modelling the selection process.
19In Wave 1, carried out between 13 March and 25 April 2008, the target was to recruit 1,700 CAWI participants aged 15 years or older, and more than 1,800 were obtained. The cooperation rate (the proportion of those invited and eligible who did respond, i.e. who answered the interview questions) was 78.0%. There was also an RDD (random digit dial) component of about 400 respondents aged 18 years or older, included for purposes of assessing the mode effects and facilitating comparison with the ITC surveys in France, Germany and the United Kingdom which are conducted entirely by telephone. For the RDD component, the cooperation rate was 78.1%. The response rate (the number of respondents as a proportion of the estimated number of telephone numbers attempted which could have reached eligible individuals) was only 4.2%. This is not unusual in the Netherlands, where response rates to telephone surveys have been declining since the 1990s (Bronner and Kuijlen, 2007). It should be noted that a low response rate does not necessarily translate into large nonresponse biases, particularly in econometric studies. Nagelhout et al. (2010) compared the demographic composition of the CATI (Computer Assisted Telephone Interviewing) sample with that of the population as determined by Statistics Netherlands (CBS), and found it to be reasonably representative.
20The response distributions for many of the questions are different for CATI and CAWI administration, as shown by the Appendix Tables of this article and by some formal analyses in Nagelhout et al. (2008).
II – The model
21The questions from the ITC Netherlands Survey chosen for analysis in Section III have five response options. We thus describe a model here for ordinal responses with five options. The basic model is the following, where Y denotes the coded response:
23where d goes from 1 to 4, and
25Notice that the probability of the highest response, 5, is one minus the fourth probability given here.
26To understand this ordinal logistic regression model [1], we can picture an imaginary underlying continuous response ? for which the range is divided into five parts by response option thresholds ?_{1} ? ?_{2} ? ?_{3} ? ?_{4}. When ? belongs to the dth part of the range, the observed response Y is equal to d. For example, if ? is between ?_{3} and ?_{4}, then Y is equal to 4, while if ? is above ?_{4}, then Y is equal to 5. The logit of the probability that Y ? d has a linear form in which the fixed explanatory variables are denoted by mode and Z, shown in [2]. The intercept parameters c_{1} ? c_{2} ? c_{3} ? c_{4} can be thought of as location parameters for the response option thresholds. That is, a shift in c_{d} implies a corresponding shift in the expectation of the threshold ?_{d}. In this model, the parameters ? and ? are the administration mode effect parameters. If the variable mode takes values 0 and 1, for web and telephone respectively, ? represents the amount by which the telephone mode translates the locations, and ? represents an amount by which the locations may be spread apart or contracted by the telephone mode. If there is a tendency for more extreme responses with telephone (mode = 1), then we would expect ? to be negative (increasing ?_{1} and the probability for response 1) and ? to be positive, with3? > ? or 3? + ? > 0 (decreasing ?_{4} and increasing the probability for response 5). If there is a tendency to select more recent responses with telephone, we would expect ? to be nonsignificant or positive, and ? again to be positive, leading to a decrease in all of the ?_{d}. The (?, ?) parameterization is intended as a parsimonious expression for the combined administration effects of mode.
27The variable Z is the logit of the individual’s estimated propensity to respond by telephone (mode 1) in terms of the covariates of interest X (such as sex, age group and education) and additional variables W; it is obtained from a separate logistic regression. Thus B_{1}, the coefficient of Z in the ordinal response model, is taken to be the selection effect parameter. Depending on the context, other variables explanatory of the response itself could be added, as in the example in Section IV. If they were not orthogonal to Z, the coefficient of Z would no longer be taken to measure the selection effect. However, within the model, the parameters ? and ? are still interpreted as administration effect parameters.
28The variable u is a latent variable or random effect, which is assumed to be N(0,1), independently for each individual, and b_{0} is a positive multiplier. This variable is included to account for individual variability, and to allow the model to be fit using SAS PROC NLMIXED, where the presence of a random effect is required for convergence.
29Useful references for ordinal response models like the one proposed here include McCullagh and Nelder (1989), and Grilli and Pratesi (2004).
III – Results for a selection of questions in the ITC Netherlands Survey
30Before modelling the mode effects in the questions from the ITC Netherlands Survey, the first step in the method was to model the propensity to respond by telephone, using SAS PROC LOGISTIC. Web respondents under 18 years of age were removed from this illustrative analysis, since their telephone propensity would be 0. The variables X were taken to be sex, age group, and education, since these are demographic controls used in most ITC Project analyses. The additional characteristics W were marital status and some individual attitude variables – possible “webographic” variables in the terminology of Schonlau et al. (2007) – for which the response distributions had been found in a preliminary analysis to vary significantly by mode. These were:
 time perspective measured by the statement “You spend a lot of time thinking about how what you do today will affect your life in the future”, with five response options “Strongly agree”, “Agree”, “Neither agree nor disagree”, “Disagree”, and “Strongly disagree”, coded as 1 through 5;
 personal stress #1 measured by “In the last six months, how often have you felt that difficulties were piling up so high that you could not overcome them?”;
 world event stress measured by “In the last six months, how often have you been distressed by world events?”;
 personal stress #2 measured by “In the last 6 months, how often have you felt that you were unable to control the important things in your life?”.
31When choosing the variables to include in the propensity formula, the aim is to produce a good predictor of response by telephone rather than to produce an explanation of it. If the propensity is well described by its model, controlling for the estimated propensity in the final term in the model can account for the sampling or selection effect of mode. Readers are referred to Rosenbaum and Rubin (1984) and Riou Franca et al. (2009) for further details of propensity score modelling.
32The fitted propensity model, which was the basis for estimating individual propensity scores, is given in Table 1.
Model for propensity for telephone response^{(a), (b), (c)}
Model for propensity for telephone response^{(a), (b), (c)}
(a) Wald test with 3 degrees of freedom.(b) Wald test with 2 degrees of freedom.
(c) See text for details.
33Clearly the propensity to respond by telephone was much lower for those in the younger age groups, and although this finding partly reflects the fact that younger people are less likely to use landlines, it is largely due to the deliberate oversampling of younger smokers from the web database. The propensity to respond by telephone was higher for those in the upper two education levels. Controlling for sex, age group, and education, the propensity to respond by telephone was lower for those scoring higher on “lack of time perspective” and the “personal stress” variables, and higher for those scoring higher on the “world event stress” variable. Translated into terms of selection effects, the results suggest that, relative to the telephone sample, the web sample oversampled the younger age groups (as it did by design) and those with higher levels of lack of time perspective and of the personal stress variables. Also, the web sample undersampled the upper two education levels and those with higher levels of the world event stress variable. [2]
34As indicated earlier, the questions chosen for analysis from the ITC Netherlands Survey have five response options, such as “Never”, “Rarely”, “Sometimes”, “Often” and “Very Often”. The options are coded as 1 through 5. “Refused” and “Don’t know” were recoded as missing.
35The frequency tables of responses in the Appendix suggest that telephone respondents are more likely to select an extreme response; a more formal analysis can be found in Nagelhout et al. (2008).
36Tables 2 to 4 show the estimates for the fitted ordinal response model for some individual questions. The questions belong to three groups of Likert items (questions with five ordinal response categories), grouped by subject matter and response options. The purpose of considering several questions per group is to examine whether the administration effects are consistent within a group. A summary of the results is that the ?, ? and ? + 3? values are mainly consistent within groups, reflecting the patterns observed in the frequency tables. The Akaike Information Criterion (not shown), a commonly used measure of the relative goodness of fit, is improved in each case by the adding of ? and ?. The random effect scaling factor b_{0} is not significantly different from 0 in any case. A propensity effect is present for many of the questions.
37The first set, in Table 2, consists of questions that ask the respondent how often in the previous month, if at all, certain thoughts occurred.
Model results for questions with five response options for frequency^{(a)}
Model results for questions with five response options for frequency^{(a)}
(a) Response options: 1 = Never, 2 = Rarely, 3 = Sometimes, 4 = Often, and 5 = Very often.Note: ? < 0 and ? + 3? > 0 express greater tendency of telephone respondents than web respondents to give a low answer (1) or a high answer (5), respectively.
Significance levels: *pvalue < 0.05, **pvalue < 0.01, ***pvalue < 0.001.
38The values of the administration mode effect parameters ?, ? and ? + 3?, are significant, similar across the set, and consistent with the possibility that telephone respondents give more extreme responses. The dependence on propensity to respond by telephone, as expressed by B_{1}, is statistically significant for the first three, and all estimates are positive, suggesting a closer to upper end response for those more likely to respond by telephone (since the probabilities of the lower categories decrease as propensity rises); B_{1} is not statistically significant in the last column. Other questions in the survey with the same set of response options have been analysed and shown to give similar though less strongly significant estimates for the ? and ? parameters; these results are not include here, to save space.
39Table 3 shows results for questions in which respondents are asked to indicate the extent to which they agree with certain statements.
Model results for questions with five response options for degree of agreement^{(a)}
Model results for questions with five response options for degree of agreement^{(a)}
(a) Response options: 1 = Strongly agree, 2 = Agree, 3 = Neither agree nor disagree, 4 = Disagree, and 5 = Strongly disagree.Significance levels: *pvalue < 0.05, **pvalue < 0.01, ***pvalue < 0.001.
40The first variable in Table 3 is atypical, in that ? is nonsignificant and barely negative, while ? and ? + 3? are significantly positive, reflecting a greater tendency to answer the most recent responses by telephone. The strong double negative in “disagree” and “disapproves” may be a factor in producing a different pattern. The second and third variables in Table 3 show results like those for Table 2, in line with a tendency for telephone respondents to select more extreme responses, but with 3? + ? not significant for the second variable. For the first and second variables in Table 3 we see significant negative values for B_{1}. Since this increases the probabilities of the lowerend options, it appears that those with higher telephone propensity are less likely to choose the upper end options (disagreement) with statements affirming disapproval of smoking, and perhaps are more sensitive to social norms. The results for ? and B_{1} suggest that, for the first question and to some extent the second, the sampling bias and the administration effects are pulling in opposite directions.
41The first variable in Table 4 shows a pattern for ? and ? similar to those of Table 2 and ? is not significant. The results are again consistent with a tendency of telephone respondents to give more extreme responses. The second variable appears to show a tendency for telephone respondents to give closer to lowerend responses (due to an administration mode effect), and a nonsignificant tendency for those more likely (in terms of their characteristics) to respond by telephone to favour the response closest to the upper end. Interestingly, very similar proportions are obtained for the middle option [3] (33.95 for web and 34.90 for telephone); as well as a slightly greater use of the extreme options by telephone respondents. For the third variable in Table 4, the significantly positive ? and ? + 3? show telephone respondents more likely to choose the final, “extremely sure” option; the dependence on propensity is not significant.
Model results for questions with five response options^{(a)}
Model results for questions with five response options^{(a)}
(a) Response options: 1 = Not at All, 2= Slightly, 3 = Moderately, 4 = Very much, and 5 = Extremely.Significance levels: *pvalue < 0.05, **pvalue < 0.01, ***pvalue < 0.001.
42Thus we see that the administration effects as we have expressed them are significant and fairly strong in all groups. They appear to be more similar within question groups where not only the set of response options but also the questions themselves are alike. For some questions a selection effect as well as an administration effect can be seen. It should be noted that we have expressed the administration effects only in terms of favouring responses close to one or both ends of the scale, and have not emphasized the possibility that they may include social desirability effects. Such effects may well be present, with size and direction depending on the nature of the question and the response options.
IV – Results for a label salience variable across four European countries
43One interesting comparison across the ITC countries in Europe concerns the reactions of smokers to enhanced text warnings on cigarette pack labels introduced in the EU in 2003. The study of ITC results presented by Hitchman et al. (2011) suggests not only countrytocountry differences in key variables, but also some difference in response patterns between the phone and web samples in the Netherlands. The other countries in the comparison are France, Germany and the UK, all with telephone as the mode of administration.
44The label noticing variable: “In the last month, how often, if at all, have you noticed the warning labels on cigarette packages?” has five response options: “Never”; “Rarely”, “Sometimes”, “Often” and “Very often”. Therefore, we applied the model of Section II, adding a country term, and including the propensity term only in the case of the Netherlands. The model is thus given by
46where d goes from 1 to 4, and
48For this model we again set the variable mode equal to 1 for telephone, and 0 for the web.
49In formula (4), C is the set of country indicators, ? is the vector coefficient for the country indicators, I_{Neth} is an indicator for the Netherlands, and f(X) represents a onedimensional summary of the demographic variables which is a good predictor for noticing labels (reduced to accommodate a limitation of PROC NLMIXED). The demographic variables combined in the predictor are sex, age group, ethnicity (country of birth for France), education, cigarettes per day and time to first cigarette. The variable Z stands for the logit of the propensity for web in the Netherlands. Crosssectional survey weights, scaled to sum to country sample size, have been used in this analysis. The sample sizes are 1,383 for web and 347 for telephone in the Netherlands, and 1,559 in France, 1,361 in Germany and 1,412 in the UK. The results are given in Table 5. The reference level for country is Germany.
50Table 5 shows ? negative and significant, and ? positive and significant, consistent with a tendency of telephone respondents to show more extreme responses. A significant and positive B_{1} suggests that the higher the propensity to respond by telephone in the Netherlands, the higher the label salience, controlling for the covariates. The individual random effect coefficient b_{0} is significant for this weighted analysis. The country effects follow the same pattern as seen in the analysis of Hitchman et al. (2011), where countries fell into two groups. Label salience was greatest, and about the same, for the United Kingdom and France. It was significantly lower in Germany and the Netherlands, but the difference between the Netherlands and Germany was barely significant at the 5% level.
Parameter estimates for a multicountry application of the mixed mode model to the question: “In the last month, how often, if at all, have you noticed the warning labels on cigarette packages?”^{(a)}
Parameter estimates for a multicountry application of the mixed mode model to the question: “In the last month, how often, if at all, have you noticed the warning labels on cigarette packages?”^{(a)}
(a) Response options: 1 = Never, 2 = Rarely, 3 = Sometimes, 4 = Often, and 5 = Very often51The variable f(X) in the Netherlands is not in fact quite orthogonal to Z in this example; the correlation between the two variables is 0.165 in the Netherlands. Therefore, the estimate of B_{1} is not interpretable purely as a selection effect. When the model is fitted with no f(X) term, the parameter point estimates change very little, but their standard errors increase; the estimate of B_{1} is reduced in magnitude to 0.4619 and its pvalue becomes 0.0539; the pvalue for Netherlands versus Germany becomes less significant at 0.1571. Thus the analysis with no f(X) term is conservative, and the inclusion of f(X) in the model means that the parameters are estimated with greater precision.
52If we treat Netherlands telephone and Netherlands web as two separate countries, the model contains no terms for the administration or selection mode effects, but takes the form:
54The results of model [5] are shown in Table 6.
Parameter estimates for a multicountry application of the model without mode terms to the question: “In the last month, how often, if at all, have you noticed the warning labels on cigarette packages?”^{(a)}
Parameter estimates for a multicountry application of the model without mode terms to the question: “In the last month, how often, if at all, have you noticed the warning labels on cigarette packages?”^{(a)}
(a) Response options: 1= Never, 2 = Rarely, 3 = Sometimes, 4 = Often, and 5 = Very often.55Netherlands telephone is not significantly different from Germany, while Netherlands web is significantly different. The model of Table 5 expresses the same difference in terms of mode parameters, and allows the data from the two samples to be combined, while accounting for the mode effects. [4]
V – Discussion: summary and limitations
56The results illustrate that the modelling approach presented in this paper can describe observed mode effects that appear to be administrative, at least for fivescale questions, and it can distinguish administration effects from selection effects associated with collection mode. The model does not distinguish different types of selection effects, such as differential coverage bias and nonresponse bias, nor can it, without a richer collection of items, separate certain administration effects such as recency (the tendency to select the most recently heard response option) and social desirability bias if the most recently heard response option has greater social desirability. For fivescale questions, if the selection effects as defined in the models of this paper are not of interest, the administration effects can still be modelled as in the paper, with the control variables including those which would have been used in a propensity score model.
57At the expense of making stronger assumptions, our method has an advantage over the one in Vannieuwenhuyze et al. (2010), in that no auxiliary single mode data set is needed.
58To use the propensity score as a summary of variables on which differential sampling bias depends, it is necessary to have, at least conceptually, a large overlap in the coverage of the two frames. A different approach would have to be adopted in a situation where, for example, adults were surveyed by telephone and young people by web. For the ITC Netherlands Survey, the coverage of the telephone and web frames can be said to have a large overlap, although it is by no means complete: there are portions of the population without landlines or without Internet access, and hence without coverage by one frame or the other.
59In the application to sets of questions in the ITC Netherlands Survey, we see that the administration effects are significant and fairly strong in all groups. The administration effects appear to be more similar within question groups where not only the set of response options but also the questions themselves are alike. The most consistent administration effect found in this study is that telephone respondents are more likely to select an extreme response than web respondents. This is consistent with previous research (Bronner and Kuijlen, 2007; Christian, Dillman, and Smyth 2005; Wichers and Zengerink, 2006). Respondents probably experience more time pressure on the phone and may use the extremes of a 5point scale as if it were a yes/no scale. Primacy (a tendency to select the firstheard response option) and recency effects may also contribute to the extremity effect.
60If other kinds of administration effects are of interest, other ways of parameterizing the model might be considered. However, to separate the sources of administration effects, the study would need to include questions and response options designed for this purpose; the ITC Netherlands questionnaire was not formulated with this aim in mind.
61We might expect the estimated models to have different characteristics, because of cognitive processing differences, for questions which are fourscale (with no middle option), threescale or binary. This could be a subject of future research. Note that if a question has a binary response, there is just one threshold point and it is not possible to identify both ? and ?; for binary responses we could construct a simpler model, with terms for propensity and a mode indicator.
62A limitation of our model is that it only includes respondents who selected one of the five response options, so it cannot describe an important mode effect, i.e. that web respondents tend to use the “don’t know” option more than telephone respondents, (Bronner and Kuijlen, 2007; Roster, Rogers, and Albaum, 2004; Wichers and Zengerink, 2006). This happens because web respondents see the “don’t know” option listed on their computer screen, while telephone interviewers do not say that there is a “don’t know” option. An earlier study showed that this was also the case for the ITC Netherlands Survey (Nagelhout et al., 2010).
63Section IV provides one example, comparing response distributions across countries, of how the model presented here can be used in the analysis of data from the ITC surveys and other multicountry surveys. Similarly, we can compare response distributions within a dual frame design from one wave to the next, by dropping the country term and adding a term for wave. Note that the dependence from wave to wave of an individual’s responses may be captured in the random effect term b_{0} u. An analysis comparing changes in distributions over time among several countries can be carried out by adding terms for country, wave, and wave by country.
64In summary, we believe the modelling approach presented in this article provides a natural and useful framework for accounting for mode of interviewing in mixed mode surveys. This is a relatively new topic. Mode effects are often tested for, but are only beginning to be incorporated in models.
Acknowledgements
The authors would like to thank the reviewers for their valuable comments and suggestions that helped improve the quality of the manuscript.Frequency tables for selected questions (CAWI and CATI modes)
“You spend a lot of time thinking about how what you do today will affect your life in the future”
“In the last six months, how often have you felt difficulties were piling up so high that you could not overcome them?”
“In the last six months, how often have you felt difficulties were piling up so high that you could not overcome them?”
“In the last six months, how often have you felt that you were unable to control the important things in your life?”
“In the last six months, how often have you felt that you were unable to control the important things in your life?”
65See website: http://www.itcproject.org.
Notes

[1]
Note that if the compositions of the telephone and web samples were the same with respect to X and W, there would be no differential selection bias in those variables, and the propensity to respond by telephone would be estimated as a constant, namely the telephone sample size divided by the total sample size.

[2]
If we had used web sample design weights which compensated for the oversampling of younger age groups, the dependence of propensity on age group would not have been significant. However, it is actually useful for our illustrative purpose to have the propensity score associated with variables that might influence the responses.

[3]
The corresponding frequency table can be viewed in Appendix 16 of the document http://www.itcproject.org/documents/researchmethods/appendixfrequencytablespdf.

[4]
Consistent with Table 6, if the two samples are combined, and the mode effects are not incorporated in the model, the Netherlands and Germany are separated in the label salience ranking, with coefficient estimate –0.87 and pvalue 0.0004.