Population (english edition)
I.N.E.D

I.S.B.N.sans
200 pages

p. 795 à 830
doi: en cours

Veille sur la revue
Veille sur l'auteur
Vous consultez

Volume 59 2004/6

2004 Population

Exploring explanatory models

An event history application

Xavier Bry  [*] Xavier Bry, LISE-CEREMADE, Université Paris IX-Dauphine, Philippe Antoine  [**]
This article presents an empirical plugging of factor analysis and generalized linear regression (logistic regression, Cox models, ...). We show that this combination can facilitate the exploration of complex data such as that on event histories (time-varying, censored) for modelling purposes. By combining a regression method with a new type of factor analysis — Thematic Components Analysis — we show how an explanatory conceptual model for the data can be included from the start of the exploratory phase. This method is then applied to an analysis of the divorce behaviour of men in Dakar, and used to give a simple illustration of each methodological point discussed. Ce travail relie de façon empirique analyses factorielles et régressions linéaires généralisées (régression logistique, de Cox, etc.). Nous montrons comment ce couplage permet de faciliter l’exploration de données complexes comme les données biographiques (variant dans le temps, incomplètement observées) en vue de leur modélisation. Nous associons une méthode de régression à une nouvelle méthode factorielle – l’analyse en composantes thématiques – qui permet de tenir compte, dès le départ, d’un modèle conceptuel explicatif des données. Cette méthode est ensuite appliquée à l’analyse du divorce des hommes à Dakar, ce qui permet d’illustrer simplement chaque point méthodologique abordé. En este artículo se relacionan de modo empírico análisis factoriales y regresiones lineales generalizadas (regresión logística, de Cox, etc.). También se muestra como tal conexión facilita el análisis de datos complejos tales como los datos biográficos (que varían a través del tiempo y cuya observación es incompleta) y su modelización. Asociamos un método de regresión a un nuevo método factorial – el análisis de componentes temáticos – que permite tomar en cuenta, desde el principio, un modelo conceptual explicativo de los datos. A continuación aplicamos este método al análisis del divorcio masculino en Dakar para ilustrar de forma simple cada paso metodológico.
Exploration and description, followed by analysis, are the objectives common to all empirical social research. The two types of method generally used for this purpose are factor analysis and generalized linear regression. Although these techniques are complementary, in practice it is rare for them to be used conjointly. In this article Xavier Bry and Philippe Antoine present an original approach that draws on the respective qualities of both methods. They then apply it in an analysis of divorce behaviour among men in Dakar based on the numerous characteristics available in a small-size sample. Thematic components analysis (TCA) synthesizes the redundant explanatory variables into a limited number of factors relevant to the initial research problem and makes possible parsimonious linear modelling.
Factor analysis and econometric techniques have the reputation for being like oil and water — invaluable for cooking but not easy to combine. The classic methods of factor analysis (e.g. PCA, MCA) are of course powerful tools for dimension reduction (synthesizing heterogeneity into a small number of factors), but they preclude any pre-established explanatory schema and are unsuitable for exploring cause and effect, and this for two main reasons. They present two characteristics that are not easily compatible with explanatory modelling: first, a variable relating approach confined to pairwise relations between the variables; second, a high degree of symmetry between variables. These bivariate relations cannot be used to measure the partial effect of one variable on another, i.e. after the influence of other determinants has been netted out. Another particularity of these factor analytic methods is that they do not order the observations sequentially and are thus unsuitable for studying dynamic processes. When observations are dated, studying a dynamic process usually requires that future outcomes be modelled as a function of conditions in the past [1], which necessarily involves sequentially ordering the observations [2].
For their part, econometric techniques, which are based on conditional models, study partial relations and are thus entirely suitable for explanatory analysis. But they must use parsimonious models if they are to avoid the multicollinearity problem caused by excessive redundancy in the explanatory variables, and produce stable estimates [3]. Very often, therefore, a preliminary stage of dimension reduction is needed.
Thus it can be seen how these techniques are complementary and why in practice their sequencing is strict. Factor analysis is employed first, purely for exploratory purposes, in order to extract a limited number of strong dimensions (or factors) from the data. In a second stage, these dimensions are introduced into an econometric model that is underpinned by an explanatory schema [4].
Unfortunately, this sequence cannot always be operationalized. First, as the variables selected in the dimension-reduction phase are calculated without reference to any explanatory schema, there is no certainty that they are the most relevant for the subsequent modelling. Second, factor analysis is seriously handicapped by censored observations, whereas modelling often permits a rigorous handling of this problem. For these two reasons an explanatory model needs to be included from the start of analysis.
The response to this situation has been the development of a new method of factor analysis, thematic components analysis (TCA) (Bry, 2003), which puts the explanatory model at the source of the dimension reduction. This method is a generalization of the partial least squares (PLS) regression developed by Wold (Wold, 1985). By construction it is adapted to classic linear modelling of continuous variables when working with non-temporal data. In this article we present a way of “plugging” it into generalized linear models, and in particular with Cox’s semi-parametric model. We begin by setting out this methodological approach, and then apply it to original data derived from a recent African survey, in an analysis of the divorce behaviour of men in Dakar.
 
I. Modelling based on latent variables
 
 
Econometric modelling is always founded on a conceptual schema. This schema is the synthesis of a substantive theoretical reflection that can alone provide the underpinning for its explanatory character. The conceptual model is often presented in the form of an oriented graph where the vertices represent various concepts or themes, which serve to characterize the observations, while the edges stand for the relations of cause-and-effect or more generally of influence between these concepts. We refer to this schema as a thematic model.
For example, to model the risk of divorce for men, we propose the thematic model shown in Figure 1. This division by themes corresponds to the research problem developed as an application in part IV of this article.
Figure 1
Thematic model for analysing the risk of divorce
IMGIMGThematic model for analysing the risk of divorceIMGIMF
In the great majority of situations, the explanatory dimensions, even if they are conceptually clear, remain “dense” and indeterminate from the standpoint of observation because they involve many characteristics for which multiple measurements may exist. In the example above, educational level is measured for the respondent and for the respondent’s mother and father. Cultural factors (ethnic group, religion, etc.) influence educational level but also the characteristics of the union. The couple’s economic situation is characterized in terms of occupation and of housing, etc. And lastly, the characteristics of the union affect a priori the outcome of the union.
For each of the themes included in the model, it is necessary to isolate the small number of dimensions needed for efficient econometric modelling, in other words interpretative dimensions that are clear and that produce a model giving a good fit to the observations.
We consider that a theme has several conceptual dimensions (for example, the cultural factors have an urban/rural dimension, an ethnic dimension, a religious dimension). Traditional practice is to select a single observed variable per conceptual dimension, in order to avoid redundancy in the model that will destabilize estimation. But the selected variable merely represents this conceptual dimension, which is often measurable in several different ways. This variable therefore serves as a proxy [5] for a latent variable that does measure the conceptual dimension correctly but remains unobserved. In practice, there is a major difficulty over choice of the “best” proxy: it is important that it have a high “representativeness” at the conceptual level but we also want to find the one with the greatest predictive power. Now this predictive power depends on the other explanatory variables introduced into the model. Thus we have to deal with a combinatorial problem.
A completely different strategy can be adopted. This bases the model on latent variables (unobserved) that will be estimated using the correlations between the observed variables that contribute to describe a single conceptual dimension. In this approach the redundancy of the observed variables around a single conceptual dimension is not a handicap but an advantage. Each latent variable underlying a group of observed variables is thus assumed to satisfy the following double constraint:
  • of correlating overall with the observed variables of this group;
  • of correlating with the other latent variables in conformity with the hypotheses of the conceptual model.
Introducing this double constraint allows us to elaborate a strategy for estimating the latent variables. A latent variable will be estimated by a factor that optimizes a particular criterion. The latent variable and the factor that estimates it will be denoted by the same letter, F.
The approach using latent variables has an advantage in terms of robustness that is especially marked when working with a small number of observations. In such cases an anomalous observation can produce a large change in the estimated coefficients. Each proxy represents its latent variable with a degree of error. The smaller the number of observations, therefore, the larger the impact of this error on estimation of the coefficients. Interpretation of the results then rests on an assumption that may be invalid, namely that the estimated coefficients accurately express the effect of these latent variables.
If, on the other hand, we build the model using stabilized estimates of latent variables based on several observed variables, we improve the robustness of the estimated effects.
For the sake of simplicity, in what follows we consider only models with a single observed dependent variable y to be explained. In consequence we are concerned with estimating the latent explanatory variables of y.
 
II. Estimating latent variables using factor analytic methods
 
 
After briefly reviewing the most traditional types of factor analysis (PCA, MCA), in which the latent variables are estimated without reference to any explanatory schema, we present two more recent factor analytic methods: PLS regression, which uses a simplified explanatory model; and thematic components analysis, which extends the previous to the full thmatic model.
1. With no thematic model: PCA, MCA
a. Calculating the first principal component
We consider a group X of J observed numerical variables xI, …, xJ that measure the different aspects of a single theme. We first assume that all the variables xj of the group are determined, subject to an error term ej, by a single latent variable F that we wish to identify (Figure 2).
Figure 2
Conceptual schema of PCA
IMGIMGConceptual schema of PCAIMGIMF
Minimizing the sum of squared residuals ej leads to identifying a factor F that is called the first principal component of X. This method has been extended to the more general case where set X is weighted by a metric M (square symmetric positive matrix of dimension J) as indicated in Box 1.
Box 1: Estimating the first principal component
Let u be a vector of magnitude J that is standardized with respect to M, i.e. that verifies u’ Mu = 1. Let there be the following maximizing programme:
Classically it is shown that the solution XMu of this programme is precisely the first principal component F.
Choice of an appropriate metric M allows categorical variables to be processed.
Let X be a group of R categorical variables. We code each variable by the set of dummy variables corresponding to its values. Thus we will denote Xr, the rth variable and the corresponding group of dummy variables. The group X is formed from the juxtaposition of these groups of dummy variables: X = (X1, …, XR). It is then processed by means of the metric M = Diag ({(XrXr)-1}r = 1 to R). PCA using this metric gives multiple correspondence analysis (Lebart et al., 1995; Bry, 1994).
Beyond the first component?
Once the first factor has been derived, we can look for a second that meets the constraint of being orthogonal to the first. The procedure is repeated until the complete PCA of X has been obtained. The first factor estimates the latent variable of a model that assumes it to be unique. It is possible to stop at the first factor only if group X is essentially one-dimensional, i.e. composed of variables that all measure, with only small differences, the same dimension. Such a situation is not very common. By and large, group X is structured around several dimensions, and it is important to identify these to avoid misrepresenting the data set. But in general the strong dimensions of X will not be pairwise uncorrelated. We look for uncorrelated factors because this simplifies some of the calculations and the graphing of the correlations between variables. So these factors cannot systematically be taken for realistic estimations of the latent variables. They are primarily a tool for visualizing the structure of X on a reduced number of dimensions, which is indispensable for its exploration. This relaxation is necessary for all the factor analytic methods that calculate several factors by groups.
When we calculate several factors by groups, they are written F1, …, Fα, …
b. Interpreting the factors
The factors that estimate the latent variables are interpretable from their correlations with the observed variables. It is convenient to make a graphical representation of the observed variables in the factor base, each variable xj having as coordinate on the Fα factor axis its correlation ρ (xj,Fα) with this factor (cf. Figure 3).
Figure 3
Factor representation of the variables of group X
IMGIMGFactor representation of the variables of group XIMGIMF
We look for the variables having the highest (positive or negative) correlation with each factor to determine its meaning. The plane defined by two factors is sometimes easier to interpret than the factors in isolation [6]. For each factor plane it is important to examine all the factors that are strongly represented in it. The details of the rules for interpreting a PCA are given in Lebart et al. (1995) and Bry (1994).
The conceptual model of PCA is too simple to translate the causalities underlying latent variables.
2. With a monothematic model: PLS regression
a. Model and estimation
Let us take again the PCA model, making latent variable F explanatory of an observed variable y (Figure 4). The group X is weighted by metric M.
Figure 4
Conceptual schema of PLS regression
IMGIMGConceptual schema of PLS regressionIMGIMF
F is estimated by solving a maximization programme that integrates the relation between F and X and that between F and y (see Box 2).
Box 2: Estimating F in a PLS regression
Let u be a vector standardized with respect to M. We let F = Xmu and solve the following programme:
The maximized criterion is a composite. For <XMu|ly> = ||XM|| cos (XMu,y).
Now, as we have seen, maximization of ||XMu|| on its own leads to the PCA of X, while that of the cosine of Xmu and y leads to regression of y on X.
This new programme thus results in a compromise between PCA of X (fitting of the latent variable to set X) and regression of y on X (estimation of y from the latent variable) (cf. Tenenhaus, 1998; Bry, 2001b). These relations between on the one hand F and X, and between F and y on the other, are represented on Figure 5.
Figure 5
The initial PLS programme
IMGIMGThe initial PLS programmeIMGIMF
Solving the maximization programme Q gives a factor F proportional to XMX′y. We denote this latter quantity RX,My, which has been named the resultant of y on the group X weighted by M.
Its properties are studied in Bry (2001b and 2004). The essential property is that when we apply it to some variable z, the operator of resultant RX,M = XMX′ draws z towards the strongest structures of X (as embodied by its first principal components).
It is important to note that with y designating the orthogonal projection (usual regression) of y on X, we have: RX, My = RX, My. In consequence, proceeding heuristically, we can conceive calculation of the first factor of PLS as the succession of the following two operations: 1) regression of y on X (optimizing the estimate); 2) calculation of the resultant drawing the estimated part y towards the strong structures of X.
Solution of the programme Q produces a single factor. If we want to organize X around several factors, we rerun the programme under the constraint of orthogonality between the new factors to be found and those already extracted.
When X consists of numerical variables, they will be standardized and we use the metric M = I. When X consists of categorical variables, these will be coded by the dummy variables of their modalities and we use the metric of the MCA (cf. Box 1). It is perfectly possible to process a mixed group (containing both numerical and categorical variables) by using a block-diagonal matrix M, in which the diagonal element mjj corresponding to a numerical variable xj is equal to 1, and the block-diagonal corresponding to a categorical variable Xr is equal to (Xr′Xr)-1. The choice of metrics is discussed in Cazes (1997), Tenenhaus (1999) and Bry (2001b).
b. Interpreting the results
The factors of X are interpreted in the same way as for a PCA (we give the same representation of the variables of X in its factor base). Visualizing the correlations between observed variables provides an essential indication on the degree of realism of the “all things being equal” reasoning when interpreting the estimated model. This visualization is also essential for selecting the explanatory dimensions that will finally be retained: it enables them to be sorted and interpreted in terms of relations with the observed variables.
In addition, because the factors estimate the latent explanatory variables of y, they can be considered as an intermediary of calculation for elaborating a formula for estimating y from the observed variables xj. Each factor is written as a linear combination of the variables of its group, so it is tempting to interpret the coefficient of a variable in this combination as measuring the role played by this variable in the formation of the factor. Later, by regressing y on the factors, we will seek the share of each factor in the formation of y. By combining the two, it is straightforward to reconstruct a formula that estimates y from the variables xj.
Interpreting the factors from the coefficients of the variables that contribute to their formation, though common, is problematic when the variables are affected by multicollinearity or are close together. The coefficients are then unstable, as the effect of some variables can transfer onto others. Although the mode of calculating the factors under PLS limits the confusion of effects (De Jong, 1995), we consider interpretation based on the variable-factor correlations preferable to that based on the coefficients.
The observations can be represented on the factor planes. This is useful in particular for detecting anomalous observations and clusters of observations. By visualizing the distribution of the scatter of observations one can see whether an effect is indeed a general pattern in the data or simply the product of some anomalous observations, thus completely changing its interpretation.
The dependent variable is regressed on the factors; being uncorrelated they allow additive decomposition of the variance of y and elimination of the least important factors.
c. Interest and limitations of the model
The interest of the factors obtained from PLS regression, compared with those from PCA, is that, even though they are related to the strong structures of X, they are a priori much more efficient for estimating y. Compared with classic OLS regression, PLS regression has an advantage in that, being based on the strong structures of X rather than on all its dimensions, it eliminates from the model its most fragile part. This makes estimation more robust and the model easier to interpret. There results a slight loss of quality of fit that one can try to keep as small as possible by including more factors in the model. However, in so far as a proportion of the fit is merely due to noise, the improvement obtained is misleading, and it is better to eliminate the corresponding dimensions. From a practical point of view, PLS regression greatly facilitates analysis of the determinants of y by presenting a hierarchy of non-redundant factors that permits visualization of the correlation structures of the explanatory group X.
PLS regression thus has major advantages compared with its OLS counterpart. However, it is no longer suitable once the conceptual model contains several explanatory themes. If, when using PLS regression, we decide to include all the explanatory themes, it will generate hybrid factors that are all the harder to interpret because they mix together variables that are conceptually heterogeneous.
3. With a multi-thematic model: thematic components analysis
a. Presentation
Here we consider that a phenomenon described by an observed variable y has as determinants R explanatory groups X1, …, Xr, … XR corresponding to the same number of themes (Figure 6). The variables in groups X1, …, XR are together written X. Each group Xr is weighted by a metric Mr. To simplify, we begin by considering that each group Xr is structured around a single latent variable Fr.
Figure 6
Conceptual schema of TCA1: multi-thematic modelling of a variable y
IMGIMGConceptual schema of TCA1: multi-thematic modellin...IMGIMF
When we are looking for a factor Fr that represents the group Xr in its relationship with y, we have to allow for the existence of other explanatory factors of y. Before applying a programme of type Q between y and Xr, the influence of the other common factors must be eliminated. This is what thematic components analysis does. The general construction of TCA and its properties are set out in Bry (2003). In the most general case, we have a group Y of variables yk to be explained. When, as in the present case, the group to be explained contains a single observed variable y, the method is written TCA1, and its algorithm is simpler.
b. The TCA1 algorithm
Step 1 (calculation of rank 1 factors)
Iteration 0 (initialization):
For the initial value of each factor Fr we take the resultant of y on Xr, as defined in section II-2-a.
Iteration k, k>0 (Figure 7):
Figure 7
Step 1 in TCA1, calculation of Fr(k) in current iteration k
IMGIMGStep 1 in TCA1, calculation of Fr(k) in current it...IMGIMF
For r going from 1 to R, we denote F-r (k–1) the group of factors obtained at step k–1 that does not include Fr, and we proceed as follows for calculating Fr (k):
– We regress y on {Xr,F-r (k–1)}. We write ŷr the estimated component ŷ along subspace <Xr>.
This component is the projection of ŷ on <Xr> parallel to <F-r (k–1)>. It yields the best estimate of y from group Xr and from the factors obtained in the other groups.
– We make: Fr(k) = XrMrXr′yr standardized. Calculating this resultant attracts ŷr towards the strong structures of Xr.
End: We stop when the results are judged to be sufficiently stable.
Step n (calculation of the rank n factors), n>1:
Each group Xr is replaced by its regression residuals on its rank 1, … n –1 factors. We then perform the same calculations as in step 1, but with a slight modification. For we have to take into account the factors obtained in the previous steps: each of these factors will be considered in the current step as a group in its own right (thus it is equal to the latent variable of this “group”).
c. Interpreting the results
Each thematic group Xr is represented in its factor base, as in the case of the PLS method. The interpretation of the factors is done under the same rules.
One can also represent the scatter of observations in the thematic planes and use it in the same way as in PLS regression.
The dependent variable y is regressed on the selected factors to identify those that play a role in its estimation. After eliminating the least important factors, a final regression is performed to estimate the latent model.
d. Advantages of the method
This generalization of PLS regression respects the thematic division established by the multi-thematic conceptual model. It permits exploration of each theme in the framework of the conceptual model specified at the outset, starting with the dimensions the most useful for estimating y.
It is important to understand that, just as the choice of explanatory variables in a classic regression completely determines the interpretation of the effects (and of course their estimation), so choice of the division by themes strongly determines the results of TCA. There is nothing unusual in this, since when we change the conceptual model we change the perspective on the data. Is this sensitivity to the initial model a shortcoming? In our view it is exactly the opposite. It must be remembered that every statistical method is founded on a conceptual model (the selection of variables is itself). When this model is not clearly apparent, it is nonetheless implicit, and its invisibility increases the risk of bias in the conclusions [7]. TCA forces analysts to specify their model from the outset and, consequently, to make a case for it. The thematic division, rarely self-evident, requires them to make explicit reference to a theory, which in return can alone provide them with interpretational keys. Thus it seems to us that TCA has a rationalizing function for practices at the epistemological level.
By making it possible to visualize each of the themes on the rank-ordered factors, TCA facilitates the selection of relevant predictor variables. In this respect TCA differs radically from automatic predictor selection methods. The latter: 1) make no conceptual distinction between the predictors; 2) select from among highly redundant variables, and for the sake of small improvements on the fitting criterion may exclude the most pertinent variable in favour of one far less pertinent; 3) take the decision-making role away from the analyst. By respecting the conceptual model and by reducing the dimension of the problem without excluding any variable from the graphical representations, TCA puts the analyst back at the centre of decision-making.
Unfortunately, the TCA presented above cannot be applied directly to complex data like those on life histories, which are characterized by temporal variations and censored observation. We have to introduce an intermediate step.
 
III. Estimating a generalized linear model
 
 
1. The model
Here we consider the case of an observed dependent variable y that is not continuous, to be explained by means of R explanatory groups X1, …, Xr, … XR. To simplify, each group Xr is initially assumed to be structured around a single latent variable Fr. Direct linear modelling of y as a function of the continuous latent variables Fr is not suitable. Thus we shall use a generalized linear modelling of y as a function of X. We thus assume that y obeys a law Pθ, where θ is a parameter of form g (Xb) and g is a known mapping. The coefficient-vector b being unknown, the variable W = Xb used by this model is unobserved and hence latent. Generalized linear regression classically estimates W by maximum likelihood of the model. We assume lastly that W is itself partially a function of the latent variables Fr of the Xr embodying the strong structures of these groups. The conceptual model that we use is shown as a schema in Figure 8.
Figure 8
Conceptual schema of plugging between TCA and generalized linear modelling
IMGIMGConceptual schema of plugging between TCA and gene...IMGIMF
2. Method of estimation
If we do not wish to obtain confidence intervals or to test hypotheses on the effects, we can make do with an empirical estimation. Otherwise, a number of modifications have to be made to ensure the analysis is correct. It is crucial that empirically estimated factors not be used as exogenous variables for a model of y that we wish to estimate by maximum likelihood: because the endogenous variable y appears in the calculation of these factors, they are no longer exogenous, and any inferential method that treats them as such is invalid [8].
a. Empirical approach
The steps are as follows:
  1. We estimate latent variable W by maximum likelihood of the model explaining y as a function of X.
  2. We then estimate the Fr using W as dependent variable in the TCA 1.
  3. We determine the number p of useful explanatory factors. This can be done by examining the proportion of W’s variance explained by the selected factors.
  4. We interpret the factors.
Comments:
  • For the maximum likelihood procedure in step 1, we use all the available predictor variables. In this way we make best use of their predictive potential: the subspace they span is used in full, and all its dimensions contribute with a priori equal importance, whether they are structurally strong dimensions or residual dimensions.
  • If we include all the factors, at step 3, we get the maximum likelihood estimator of W initially calculated.
  • In the particular case where y is a continuous variable following a normal distribution as a function of X, the procedure we propose here is identical to TCA1. Estimating W (first step) produces the regression y of y on X. Now from examining its algorithm it is easy to see that the TCA1 of y (next step) is equivalent to that of y.
  • Another extension of the TCA1 to a generalized linear model is possible. This simply involves, in the current step, replacing the regression of y on {Xr,F-r (k –1)} by its generalized regression (logistic, Cox model, …). The component ŷr is equal to the Xrr obtained in this regression. If this extension appears to be more direct this is because the latent variable W is made implicit. However, this method takes longer to compute, since it has to maximize the likelihood at each iteration.
b. Inferential approach
A model can be constructed that permits inference (calculation of confidence intervals and tests) after step 4. To specify a model based on a selection of observed variables, we proceed by the following steps:
  1. We select a subset of the original explanatory variables representative of all the factors, in the sense that these variables are both correlated with the factors and illustrative of the substantive interpretation that has been given for them.
  2. We estimate maximum likelihood for the model limited to the explanatory variables selected.
To specify a model based on the latent variables, we proceed by the following steps:
  1. For each factor (or each major direction of an explanatory plane) we select a subset of original explanatory variables that are strongly correlated with it and illustrative of the substantive interpretation that has been given for them.
We perform the PCA of each of these subsets separately, and we retain the first principal component as being that which estimates the latent variable underlying each subset.
  1. We estimate maximum likelihood of the model based on these principal components. Although their mode of selection introduces a small amount of endogeneity into the chosen observed variables, calculation of the principal components does not introduce y. In consequence, inference based on this model can be considered as legitimate.
3. Application to failure-time analysis
We now model the occurrence of an event E for an individual as a function of the characteristics of this individual.
a. Cox’s model
The risk of experiencing the event at time t is an exponential mapping of the characteristics xt of the individual at this time:
The likelihood of the trajectory of an individual for whom event E occurs at time tE is:
Consider now a sample of independent individuals. The index used to identify the individuals is i. The likelihood of the model applied to the sample is the product of the individual likelihoods:
From a formal viewpoint this is equivalent to a sample likelihood where the observations are not the individual themselves but the pairs (individual, date of observation): (i, t). An individual is the subject of as many observations as there are dates, and the model above makes them formally independent, thus making it possible to work with time-varying characteristics. For each pair (i, t) there is a corresponding value for the determinants of the risk xit, hence a value for the latent variable W = b’xit and a value for the risk: . In the same way, the factors Fr that we calculate afterwards will have a value for each pair (i, t).
b. TCA after Cox’s regression
We perform the Cox regression using ail the available characteristics. The adjustment for losses to follow-up and the modelling of the dynamic are carried out at this stage. The Cox regression supplies an estimate of W written b’xit for individual i at time t. This estimate is based equally on the strong dimensions and on the residual dimensions of the subspace of explanatory variables.
We then perform the TCA of this estimate on the explanatory groups, to extract the latent explanatory variables. This TCA takes as observation the pairs (individuals, date of observation): (i, t).
We will now apply this method to an analysis of divorce behaviour among men in Dakar.
 
IV. Analysis of divorce in Dakar
 
 
1. Framework of analysis
In Senegal, as indeed in Africa more generally, almost nothing is known about the trend in the frequency of divorce or about its determinants (Kaufmann et al., 1988). The poor state of our knowledge of marital disruption in Africa is an obstacle to analysis of the evolution in the phenomenon over time (Smith et al., 1984; Hertrich and Locoh, 1999). The factor with the greatest influence on divorce appears to be women’s labour market participation. It is linked to the possibility of financial independence for women, enabling them to support themselves economically in the event of divorce (McDonald, 1985; Burnham, 1987).
Period data from survey or census sources underestimate the frequency of divorce. This is because censuses and surveys only ask for the marital status at the time of the investigation; in some cases we also know the number of unions an individual has contracted though without knowing if remarriage is consequent upon widowhood or divorce. In societies where polygyny is practised, studies concerning men focus more on the number of wives than on the number of divorces (Antoine et al., 1998). Union dissolutions are often masked by rapid remarriage.
In Senegal, for example, in 1986, the proportion of divorced women was 3.8% at ages 20-24 years and 5.4% at 35-39 years, with the proportion falling at higher ages (Ministère de l’Économie, des Finances et du Plan du Sénégal and DHS, 1988). These proportions were appreciably lower in 1992-1993, being 3.5% and 4.7% respectively at the same ages (Ministère de l’Économie, des Finances et du Plan du Sénégal and DHS, 1994). These figures completely obscure the extent of the phenomenon. Thus according to a survey conducted in Dakar in 2001, at ages 40-44, around 4% of the men reported being divorced [9], whereas 22% had already experienced a divorce by age 40 [10] (Antoine and Fall, 2002). Our life history data indicate that one in three unions in Dakar end in divorce (Antoine and Dial, 2003).
Divorce jeopardizes the alliances formed between families through marriage, and for this reason is seen as a source of social disruption not to be talked about (Locoh and Thiriat, 1995). Families take an active part in both marriage and divorce, and the individuals sometimes have no control over decisions. Family pressures to prevent divorce are many. Often, the kinship group perceives divorce as a failure; it usually represents a break between the two partners’ families, who may try to prevent it occurring. On the other hand, among the reasons that led them to separate from their partner, women frequently cite interference by in-laws in the life of the couple.
Before the introduction of the Code de la Famille (Family Code) in 1972, legal separation in Senegal was available only to the man, who could repudiate his wife by statement in the presence of two adult witnesses. The 1972 law attempted to redress the imbalance created by repudiation — a unilateral act for which the initiative lay solely with the man — by allowing the woman to petition for a divorce in the courts. Despite this progress, in the great majority of cases, divorce still occurs outside of the legislative provisions: fewer than 20% of divorces [11] give rise to judicial proceedings. Civil marriage is rare, and religious marriage is what counts most of all [12]. Traditionally, a woman can request a divorce [13] from her husband (nâan baat); this form of divorce is called tagoo in Wolof and is quite distinct from repudiation (fase) (Diop, 1985).
However, some women do seem to be aware of their new rights and an increase is observed in divorce instigated by women. According to our survey, they are at the origin of 80% of divorces, a phenomenon noted earlier by Diop (1985). The rise in divorce instigated by women appears to reflect a social change. Proscribed both by Islam and by society, divorce has nevertheless become a common and unremarkable phenomenon (Dial, 2001). Divorce is frequent but also relatively early: a large proportion of divorces occur during the first five years of marriage (Antoine and Dial, 2003). The overriding importance of marriage for women in Dakar may encourage decisions that are sometimes hasty. The phenomenon is all the more poorly understood for being poorly measured.
a. The event history data
The analysis is based on data from a life history survey conducted in Dakar in 2001 [14]. Three cohorts are considered, consisting of persons aged respectively 25-34, 35-44 and 45-59 at the time of the survey [15]. The survey in Dakar produced a total of 1,290 life histories of men and women, which record the lives of the individuals up to the time of the survey. Thus we know the characteristics of each individual as regards occupation, marital status, number of children, etc. over the whole life course.
In the particular example treated here, we examined the marital histories of men aged between 25 and 44 at the time of the survey [16]. Analysis is on the first unions of 137 men who married in Dakar. Not all of these unions had ended in divorce at the time of the survey (23 couples had already separated), in which case they remain exposed to the risk of divorce [17]. It might be feared that both persons and events were too few in number for an analysis to be feasible. We would certainly not have under-taken it had the TCA method not been available. It can be noted that our data file contains as many lines as there are episodes (here numbering 546); by “episode” is meant each status change [18] experienced by an individual since the start of his union. The last column of Table 1 gives the distribution of the individual statistics (or of the man-months in the case of status change over time [19]) by the different categories of the variables included in the analysis.

Table 1
Estimation of the effects of the potential determinants of divorce among men in Dakar (results of the Cox regression)
IMGIMGGroup	Variable	Modality	Name of vari...IMGIMF
Group Variable Modality Name of variable Relative risk ratio(a) Distribution % (man-months) Cultural factors Ethnic group Wolof wolof Ref. 43 Poular alpoular 1.38 28 Serer serer 0.25 14 Diola diola 1.72 2 Other aut_eth 0.01** 13 Religion Other Muslim aut_musu Ref. 25 Mouride mouride 36.55* 47 Tidiane tidiane 3.94 25 Christian chrétien 13.77 3 Place of birth Dakar ln_dakar Ref. 66 Rural area ln_rural 1.14 19 Urban area ln_urbain 4.48 14 Place of socialization Dakar se_dakar Ref. 63 Rural area se_rural 1.85 20 Urban area se_urbain 0.01 17 Educational factors Education mother No schooling mnonsc Ref. 92 Primary mprim 0.18 7 Secondary or above msecp 7.04 1 Education father No schooling pnonsc Ref. 71 Primary pprim 0.99 14 Secondary or above psecp 4.42 15 Education of individual No schooling nonscol Ref. 31 Primary primaire 0.63 33 Secondary or above second_p 2.11 36
IMGIMGGroup	Variable	Modality	Name of vari...IMGIMF
Group Variable Modality Name of variable Relative risk ratio(a) Distribution % (man-months) Economic factors Separate housing Yes logauto Ref. 55 Never jamloau 23.86** 45 Activity of individual Informal sector informel Ref. 26 Manager, self-employed patron 2.93 17 Employee salarie 2.46 47 Apprentice, trainee app_elev 0.66 7 Unemployed chomeur 1.81 3 Activity of wife Inactive cfnonact Ref. 67 Employee cfemploy 9.56 2 Trade cfvente 0.16 19 Servant cfdomest 0.90 13 Demographic and marital factors Children No children pasenf Ref. 26 1 enf1 2.11 25 2 enf2 0.23 21 3 or more enf3p 0.36 29 Age at marriage Under 21 am_av20 Ref. 5 21-25 am21a25 0.29 37 26-29 am26a29 0.25 33 30-34 am30a34 2.92 19 35 or more am_ap35 3.91 7 Choice of partner By family chxpar Ref. 12 By individual chxmoim 2.20 88 Related to partner Not related nonpart Ref. 54 Related via father parpat 0.82 25 Related via mother parmat 0.13* 21 Divorce experience of partner Never previously divorced cj1nodiv Ref. 97 Previously divorced cj1exdiv 0.05 3 Union type Monogamous monogame Ref. 94 Polygynous polygame 5.74 6 (a) In a Cox model, the risk of divorce is modelled by h(t) = h0(t) exp (Xb); the relative risk ratio is 1 for the reference category and exp(bi) for any given modality xi. ** Indicates that the sign of the effect is significant at the 1% level; * at the 5% level. Champ: The figures are for married men aged 25-44 at the time of the survey. Source: Life History Survey (Enquête biographique) IFAN-IRD (2001).

b. Hypotheses and conceptualization
There are various questions concerning divorce. One might, for example, want to know if, for men, union with a second woman is a way of obtaining the departure of the first without having to raise the matter of divorce with her. The taking of a second wife is not legitimate grounds for divorce, but it does seem that in urban areas the first wives of polygynous men divorce more often than the second wives. One of the most common causes of divorce [20] remains the failure of the husband to provide material support for his wife (Diop, 1985; Dial, 2001). The economic crisis is making it increasingly hard for men to provide adequately for the households for which they are responsible.
Examining the question of divorce means violating the intimacy of the couple and evoking a distressing event. Most respondents are reluctant to talk about a situation that is still poorly accepted by society. But even if some aspects of the divorce experience remain unmentionable, we can try to go beyond individual cases to identify certain structural factors and to reveal the components that weigh for and against divorce in this cohort. To do this we used a Cox model [21] that integrates the factors influencing the length of time between the start of the union (i.e. when it is celebrated at the mosque) and a possible separation [22].
For this analysis, the characteristics of the respondents constitute the primary source of data. This is because it is difficult to obtain precise information on wives, all the more so when they no longer live with the respondents. Several factors characterizing the man are included (see Figure 1). They are classified into four thematic categories: cultural factors, educational level, economic factors, and demographic and marital factors.
The cultural factors — factors related to the person’s social environment — cover ethnic group [23], religion (taking into account the different Muslim brotherhoods [24]), place of birth, and place of socialization (i.e. where the person spent most of his childhood). Various questions concern this group of variables. Do differences in attitude towards divorce according to religion [25] lead to different propensities to divorce? Place of birth and place of socialization are markers for the environment in which a person spent his early years: does the behaviour of young people socialized in Dakar differ from that of people from country areas who moved to the city later?
The second group of factors concerns those linked to the education received from the parents or the school. Education depends on the social origins of the individuals, and as proxy we take the educational level achieved by each of the parents. We also use the educational level of the individual, which is a marker for a degree of independence relative to tradition.
The third group comprises the variables that characterize the socio-economic situation, namely the occupation of the man, the type of work of the first wife at the time of marriage and the situation of dependence in respect of housing.
Lastly, we select demographic variables describing the number of children produced by the union and the characteristics of the union, such as age at marriage, choice of partner, existence of a kinship tie with the partner, and previous divorce experience of the partner. These different factors may or may not influence divorce. Are low age at marriage and the absence of children factors favouring divorce? Is the stability of the union greater when there is a kinship tie between the partners? The status of the union can also change over time: the husband may take a second wife and become polygynous. This martial status change is included in the model; thanks to the union history we can know the date of arrival of a new wife and thus observe the transition from monogamy to polygyny.
2. Statistical analysis
The analysis was performed with the STATA software, and uses the TCA1 program developed by Xavier Bry. The stages are described below.
a. Estimating latent variable W
We perform the usual Cox regression using all the possible explanatory variables. The results are reported in Table 1. Latent variable W is estimated by the Xb supplied by this regression.
Very few explanatory variables have a statistically significant effect on the risk. If we retain only the interpretable modalities, only three [26] have a clear influence on the risk of early divorce. These are Mouride religion, lack of separate housing, and being related to the partner on the mother’s side. Marriage is perhaps less stable among the Mourides than among other Muslims. In some cases it is the marabout who celebrates the union, sometimes without any real discussion with the couple concerned.
The main factor favouring divorce is of an economic kind. The risks of divorce are appreciably higher when the newly formed family unit does not obtain a separate dwelling and continues to rely on family members for its accommodation. A husband’s inability to provide independent housing for his couple is associated with earlier divorce. This inability can be linked back to the lack of material support that is often cited among the causes of divorce. The protracted economic crisis means that young couples today go to live in the husband’s family home, and thus have to face the problems associated with cohabitation on a daily basis. Relations between the woman and her in-laws are as a rule extremely complicated. These couples are more exposed than others to divorce.
The existence of a kinship tie between the partners characterizes unions to which the family broadly defined is heavily committed, unions therefore that are a priori more stable. Unions with a common relative on the mother’s side tend to be subject to greater vigilance.
Our model includes many variables, several of which are redundant. Because of the multicollinearity that results from this, we cannot tell at this stage if, in addition to the three factors identified, other variables also have an important explanatory role, whose effect would be masked by the multicollinearity [27].
A tool is needed that will detect and take into account any such multicollinearity, and select the most useful variables for modelling. We use successively three methods. The first (PCA regression) is not based on any conceptual model for determining the factor planes; the second (PLS regression) includes the existence of an explanatory schema, but without distinguishing the themes; the third (TCA) integrates the totality of the explanatory schema of divorce, by including the four themes presented earlier.
b. Regression of W = X on the principal components of X
The PCA of X estimates the latent explanatory variables without using a thematic model. The first two factors account for 16.79% of the total variance; the first ten factors account for 53%, and the first twenty are needed to capture 79.6%. The eigenvalues decline very slowly, which is indicative of an unstructured scatter (no major correlation clusters). Because the first two eigenvalues are very close (8.63% and 8.06% of total variance), it is necessary to interpret the factor plane (1,2) as a whole rather than the factors in isolation. The first two principal components of X give a plane that shows the importance of the places of birth and of socialization.
The regression of X on the factors gives the following results:
On the first two factors, the correlation coefficient R2 equals 0.007; on the first ten factors, R2 = 0.512.
The factors with the greatest explanatory power are, in decreasing order, the ninth, twentieth, and fifth. These are high rank factors, and hence structurally weak, poorly correlated with the observed variables.
The results from this method prove virtually unusable.
c. PLS regression of W = X on the variables of X
Unlike PCA, PLS regression has an explanatory orientation. We note straight away that this orientation considerably improves the predictive power of the dimensions extracted. Regression of X on the PLS factors yields the following results:
On the first two factors, the correlation coefficient R2 equals 0.945; on the first four factors, R2 is 0.987.
On the initial graphings we saw that the axes were weakly correlated with the observed variables. The mixing up of themes produces an overall result that is hard to interpret. However, on the edges of the scatter a number of variables emerge, such as: Mouride religion, independent housing, other ethnic group, transition to polygyny, educational level of ego and of his parents.
Because it has no ordering by theme, this method produces a result that is again disappointing. True, X is well represented on the first factors, but these are too weakly correlated with the observed explanatory variables.
d. Thematic components analysis
We work with the four explanatory themes presented earlier:
  • X1 = cultural factors;
  • X2 = educational factors;
  • X3 = economic factors;
  • X4 = demographic and marital factors.
Calculation of the factors
We calculate two factors per thematic group. Factor j of group i is called XiFj. The factors are standardized. By regressing W = X on all these factors, W can be written as a linear combination of these factors, weighted by coefficients. The results of the regression are presented in Table 2 (R2 = 0.96).

Table 2
Regression coefficients of W on all the factors from the TCA
IMGIMGTheme	Factor	Coefficient	Cultural	X1...IMGIMF
Theme Factor Coefficient Cultural X1F1 2.60 X1F2 0.85 Educational level X2F1 1.02 X2F2 0.43 Economic X3F1 1.62 X3F2 0.70 Demographic and marital X4F1 1.52 X4F2 0.73

The rank 2 factors have coefficients that are systematically appreciably lower (roughly one half) than their rank 1 counterparts, so we perform the regression only on the latter. The results are given in Table 3 (R2 = 0.86).

Table 3
Regression coefficients of W on the rank 1 factors from the TCA
IMGIMGTheme	Factor	Coefficient	Cultural	X1...IMGIMF
Theme Factor Coefficient Cultural X1F1 2.68 Educational level X2F1 0.99 Economic X3F1 1.87 Demographic and marital X4F1 1.40

We thus manage to capture 86% of the variable W on the four rank 1 factors. This performance is not as good as that obtained with PLS regression, though this is unsurprising, since the factors of the TCA are constrained by theme. But this constraint, which eliminates the mixing up of themes, can be expected to supply factors that have a clearer interpretation.
Note that the factors with the lowest explanatory power are those of groups 2 (educational level) and 4 (demographic and marital).
Examination of the groups in the thematic planes
Factors 1 and 2 of each group give the planes in Figures 9 to 12 [28].
Figure 9
First factor plane of group 1 (cultural factors)
IMGIMGFirst factor plane of group 1 (cultural factors)IMGIMF
The variables that clearly illustrate this plane are: Mouride religion, for axis 1; place of birth and place of socialization for the plane as a whole (triangular configuration).
The urban gradient (rural–provincial urban centre–capital) is reproduced by factor 2 and not by factor 1. Now this second factor has a lower explanatory power than the first.
Figure 10
First factor plane of group 2 (educational factors)
IMGIMGFirst factor plane of group 2 (educational factors)IMGIMF
The first factor reproduces the hierarchy of educational levels (no schooling, primary, secondary or higher), both for ego and his parents. The second—with considerably less explanatory power— differentiates persons with no schooling from persons with primary schooling. We note the high degree of social reproduction: ego is very likely to have the same educational level as his mother and father.
Figure 11
First factor plane of group 3 (educational factors)
IMGIMGFirst factor plane of group 3 (educational factors)IMGIMF
Axis 1 highlights separate housing, a particularly strong factor of divorce that certainly reflects the level of the husband’s income. Axis 2, secondary, makes salient three occupations of the wife (servant, employee, trade) but has a low correlation with these three categories.
Figure 12
First factor plane of group 4 (demographic and marital factors)
IMGIMGFirst factor plane of group 4 (demographic and mar...IMGIMF
The first thematic plane is weakly correlated with the variables of the group.
This group has no strong structures with a high explanatory power.
Selecting the predictors
The overall explanatory powers of the groups are easily measured by the regression coefficients of their factors. Distinguishing different themes makes the role of the groups much clearer. The thematic planes are well illustrated (except for group 4 that concerns demographic and family factors) and thus clearly interpretable.
The rank 2 factors of the groups with low predictive power must be eliminated when these factors have no clear interpretation (X2F2 and X4F2). Factor X1F2 is also unclear and has a low predictive power, but it introduces the modalities of variables present on factor X1F1 (places of birth and of socialization). Initially, therefore, it can be retained. Factor X3F2 is weakly predictive, but correlated exclusively with the modalities of the wife’s occupation. This factor is thus retained provisionally.
Regarding the rank 1 factors, they are all retained, though with limited expectations about those without high explanatory power and/or whose interpretation is uncertain because they are poorly illustrated or influenced by too many variables. The latter do not clearly indicate the small number of modalities to be included in a parsimonious and efficient model.
Here, we retain the following factors (with their related modalities):
  • X1F1: combines religion (mouride, aut_musu) and urban origins (ln_urbain/se_urbain);
  • X1F2: contrasts Dakar origins (ln_dakar/se_Dakar) with rural origins (ln_rural/se_rural);
  • X2F1: all the modalities of educational level (see Table 4);
  • X3F1: residing or not in separate housing (jamloau, logauto);
  • X3F2: occupation of the wife (cfemploy, cfdomest, cfvente);
  • X4F1: family characteristics (number of children, age at marriage, nonpart).
For X2F1 (first factor of group 2), we saw that it reproduces the hierarchy of the levels of education, making a balanced use of every modality of educational level. It is therefore useful to synthesize it by performing a PCA on these modalities, so it can be used in the final model as an exogeneous latent variable. In this way we obtain the “niscola” — named from the French “NIveau SCOLAire” or educational level — variable (a linear combination of the variables for educational level to which the coefficients presented in Table 4 are applied) that gives a ranking of the educational “capital” of the individual.

Table 4
Variable weightings on educational level measured by NISCOLA
IMGIMGVariable	Name of variable	Coefficien...IMGIMF
Variable Name of variable Coefficient on niscola Mother no schooling mnonsc – 1.64 Father no schooling pnonsc – 0.93 Ego no schooling nonscol – 0.72 Mother primary level mprim 1.33 Father primary level pprim 0.36 Ego primary level primaire – 0.07 Mother secondary level or above msecp 2.12 Father secondary level or above psecp 1.16 Ego secondary level or above second_p 0.66 Constant 1.74

The same procedure can be followed with the places of birth and socialization to estimate a latent variable of rural versus urban birth and upbringing. For this variable (originally named “ruralité”, literally “ruralness”) we obtain a set of coefficients given in Table 5.

Table 5
Weightings of places on the ruralité variable
IMGIMGVariable	Name of variable	Coefficien...IMGIMF
Variable Name of variable Coefficient on ruralité Born in Dakar ln_dakar – 1.10 Socialized in Dakar se_dakar – 1.07 Born in an urban area ln_urbain 0.69 Socialized in an urban area se_urbain 0.58 Born in a rural area ln_rural 1.00 Socialized in a rural area se_rural 0.99 Constant 0.73

Concerning the number of children and age at marriage, it is worth reconverting these variables into numerical, or at least ordinal, variables, since the axis X4F1 comes close to reproducing their gradient. This will allow more precise estimation of the possible effect. As these two variables appear to be inter-correlated, it is likely that they cannot coexist in the same model and that one will thus have to be eliminated. We retain the one whose causal role is the most interpretable, or, failing this, the one that gives the best fit.
e. The final Cox model
We first introduce all the predictors selected above into the econometric model. We then progressively eliminate those with no convincing effect. This sorting process is much easier than starting with all the available variables. Table 6 recapitulates the variables finally selected, i.e. those with an effect significant at the 5% level.

Table 6
Effects of the determinants of divorce selected after the TCA (results from the Cox regression for the final model)
IMGIMGVariable	Name of variable	Relative r...IMGIMF
Variable Name of variable Relative risk ratio(a) Confidence interval (95%) Mouride religion mouride 8.53** [2.87;25.29] Never separate housing jamloau 4.73** [1.77; 12.64] Not related to wife nonpart 2.80* [0.95;8.20] Activity of wife: office worker cfemploy 4.87* [0.77;30.90] Educational level (continuous variable) niscola 1.30* [1.00;1.70] Age at marriage (continuous variable) agordmar 1.52* [0.94;2.46] (a) In a Cox model, the risk of divorce is modelled by h(t) = h0(t) exp (Xb); the relative risk ratio is 1 for the reference category and exp(bi) for any given modality xi. ** Indicates that the sign of the effect is significant at the 1% level; * at the 5% level. Reading: an increase of one year in the age at marriage corresponds to a 52% increase in the risk of divorce. Champ : The figures are for married men aged 25-44 at the time of the survey. Source : Life History Survey (Enquête biographique) IFAN-IRD (2001).

This model is more instructive than the initial model (Table 1). In addition to the three factors already identified (Mouride religion, lack of separate housing, and being related to the partner) the factors favouring divorce now include the wife being in paid employment, having been to school, and later marriage. Paid employment by the woman seems to increase the risk of divorce, with women who acquire a certain economic independence divorcing at a faster tempo than those in situations of economic insecurity. The effect of being related to the partner is more easily interpretable than in Table 1, with simply a contrast between the existence or not of a kinship tie, its absence increasing the probability of divorce.
Moreover, some of our hypotheses are not confirmed. The number of children (and in particular having no children) seems not to be a factor of divorce, contradicting the hypothesis that infertility of the woman might be a factor favouring divorce. Nor does the model indicate a role for polygyny [29]. It is true that we are studying the phenomenon from observation of men who are still young and for the most part recently married, few of whom are concerned by this practice.
 
Conclusion: seeing everything, retaining what is essential
 
 
The methodology presented here occupies an intermediate position between an exploratory and a confirmatory mode of analysis. If, like the latter, it requires the specification of a conceptual explanatory schema, this can remain very general and only loosely influence the measurement procedure.
The fact of having to specify a conceptual model leads the analyst from empirical description to explanation, as is not really the case with more classic data analysis methods.
The fact that the multiple measurements pertaining to a concept are retained almost to the end of the analysis offers two advantages. First, they can all express themselves in the framework of the explanatory model (despite the redundancy between them), thus allowing selection of the best among them. Second, the possible redundancy between various measurements relating to the same concept means that this can be represented more robustly by means of synthetic factors. Measurement of the concepts is in this way freed of noise, and the corresponding statistical effect in the estimated model is made more reliable. The example of educational level is particularly eloquent in this respect (cf. Table 4).
In conclusion, the analytical approach proposed has enabled us to avoid lengthy trial-and-error selection of the explanatory variables during construction of the statistical model. The initial inclusion of all the potential variables protects against omitting an important dimension. In addition, the method also makes it easy to eliminate redundancy. Using it we have produced a model that has proved richer and more reliable (Table 6) than the initial model (Table 1) for studying a relatively rare event from a sample that is small in size for this type of analysis.
 
BIBLIOGRAPHIE
 
·  Antoine Philippe, 2002, “Les complexités de la nuptialité : de la précocité des unions féminines à la polygamie masculine en Afrique”, in G. Caselli, J. Vallin, G. Wunsch (eds.), Démographie : analyse et synthèses. Vol. II–Les déterminants de la fécondité, Paris, Ined (coll. Manuels), pp. 75-102.
·  Antoine Philippe, Djiré Mamadou, Nanitelamio Jeanne, 1998, “Au cœur des relations hommes-femmes : polygamie et divorce”, in P. Antoine, D. Ouédraogo, V. Piché (eds.), Trois générations de citadins au Sahel, Paris, L’Harmattan, pp. 147-180.
·  Antoine Philippe, Abdou Salam Fall (dir.), 2002, Crise, passage à l’âge adulte et devenir de la famille dans les classes moyennes et pauvres à Dakar, Intermim report for CODESRIA, IRD-IFAN, Dakar, March 2002, 118 p + 22 p appendices.
·  Antoine Philippe, Dial Fatou Binetou, 2003, “Mariage, divorce et remariage à Dakar et Lomé”, Journées scientifiques de l’AUF, Familles du Nord, Familles du Sud, Marseille 23-26 June 2003, 22 p. (forthcoming).
·  Bry Xavier, 1994, Analyses Factorielles Simples, Economica Poche, 112 p.
·  Bry Xavier, 2001a, “Analyses Discriminantes Régularisées via la régression PLS et l’Analyse en Résultantes Covariantes”, MODULAD, No. 28, pp. 27-61.
·  Bry Xavier, 2001b, “Une autre approche de l’analyse factorielle : l’Analyse en Résultantes Covariantes”, RSA, 49(3), pp. 5-38.
·  Bry Xavier, 2003, “Une méthode d’estimation empirique d’un modèle à variables latentes : l’Analyse en Composantes Thématiques”, RSA, 51(2), pp. 5-45.
·  Bry Xavier, 2004, “Estimation empirique d’un modèle à variables latentes comportant des interactions”, RSA, 52(3) (forthcoming).
·  Burnham Philip, 1987, “Changing themes in the analysis of african marriage”, in D. Parkin, D. Nyamwaya (eds.), Transformations of African Marriage, Manchester, Manchester University Press (International African Seminars, New Series, No. 3), pp 37-54.
·  Cazes Pierre, 1997, “Adaptation de la régression PLS au cas de la régression après analyse des correspondances multiples”, RSA, XLV(2), pp. 89-99.
·  De Jong Sijmen, 1995, “PLS shrinks”, Journal of Chemometrics, Vol. 9, pp. 323-326.
·  Dial Fatou Binetou, 2001, “Le divorce, source de promotion pour la femme ?. L’exemple des femmes divorcées de Dakar et de Saint-Louis (Sénégal)”, in T. Locoh, K. Nguessan, P. Makinwa-Adebusoye (eds.), Systèmes de genre et questions de population en Afrique. Résistances et innovations, Dakar, UEPA/Paris, INED, 15 p. (forthcoming).
·  Diop Abdoulaye Bara, 1985, La famille wolof : tradition et changement, Paris, Karthala, 262 p.
·  Gould Stephen J., 1981, The Mismeasure of Man, New York, Norton.
·  Hertrich Véronique, Locoh Thérèse, 1999, Rapports de genre, formation et dissolution des unions dans les pays en développement, Liège, UIESP (Gender in population series), 46 p.
·  Kaufman Georgia, Lesthaeghe Ron, Meekers Dominique, 1988, “Les caractéristiques et tendances du mariage”, in D. Tabutin (ed.), Population et sociétés en Afrique au sud du Sahara, pp. 217-248.
·  Lebart Ludovic, Morineau Alain, Piron Marie, 1995, Statistique exploratoire multidimensionnelle, Dunod.
·  Locoh Thérèse, Thiriat Marie-Paule, 1995, “Divorce et remariage des femmes en Afrique de l’Ouest. Le cas du Togo”, Population, 50(1), pp. 61-94.
·  McDonald Peter, 1985, “Social organisation and nuptiality in developing countries”, in J. Cleland, J. Hobcraft (eds.), Reproductive Change in Developing Countries, Oxford, Oxford University Press, pp. 87-114.
·  Ministère de l’Économie, des Finances et du Plan (Direction de la prévision et de la statistique), 1988, Enquête démographique et de santé au Sénégal 1986, Dakar, DHS/Macro International, 173 p.
·  Ministère de l’Économie, des Finances et du Plan (Direction de la prévision et de la statistique), 1994, Enquête démographique et de santé au Sénégal 1992-93 (EDS II). Dakar; Calverton, DHS/Macro International, 284 p.
·  Smith David P., Carrasco Enrique, McDonald Peter, 1984, Marriage Dissolution and Remarriage, Voorburg, International Statistical Institute (World Fertility Survey Comparative Studies, No. 34), 94 p.
·  Tenenhaus Michel, 1998, La régression PLS, théorie et pratique, Technip.
·  Tenenhaus Michel, 1999, “L’approche PLS”, RSA, 47(2), pp. 5-40.
·  Wold Hermann, 1985, “Partial least squares”, Encyclopedia of Statistical Sciences, John Wiley & Sons, pp. 581-591.
 
NOTES
 
[*]LISE-CEREMADE, Université Paris IX-Dauphine.
[**]IRD, Équipe JÉREMI, UR DIAL-CIPRE.Translated by Godfrey I. Rogers.
[1]This is not true, however, of spectral analyses, such as harmonic analysis.
[2]For example, Cox’s regression models an instantaneous risk (that of an event occurring in the near future) as a function of the individual’s acquired characteristics (characteristics that can of course include all aspects of the individual’s past and can change over time).
[3]Use of econometric modelling, based on establishing the value or even simply the sign of the parameter estimates, requires a minimum of stability in these estimates.
[4]Some factor analytic techniques, such as canonical correlation analysis (CCA) and discriminant factor analysis (DFA), generalize multiple regression (Bry, 2001a). For this reason one might be tempted to classify them with the “explanatory” methods. In our view this would be a mistake. Canonical correlation analysis reestablishes total symmetry between the two sets of variables involved, so its use is naturally exploratory. As for so-called “discriminant analysis”, it has the reputation of “explaining” a categorical variable from a set of predictor variables. In reality, the term “discriminant analysis” covers a number of procedures, some of which deserve this reputation while others do not, depending on the conditioning that they use. Logistic regression, for example, uses a conditioning of the categorical variable by the predictors; potentially therefore it can aspire to explain the former by the latter. As for discriminant factor analysis, it is a special case of canonical correlation analysis and a priori uses no conditioning. A practical criterion can be proposed that makes it easy to settle the issue: an authentic “explanatory” method, using a conditioning of the variable to be explained, immediately gives an estimation formula for the latter. This is not the case for either CCA or DFA.
[5]A proxy merely represents, with some degree of error, the associated latent variable.
[6]With the factors no longer estimating a priori latent variables, interpreting each of them in isolation ceases to be indispensable—often indeed it is pointless, since the strong structures of X, if they are not uncorrelated, diverge from these latent variables. On the other hand, the sub-space formed by the p first factor axes contains by definition the principal structures of X. To identify them, we examine the first factor planes, by comparing them with each other, in an attempt to go a little beyond dimension 2.A disregard for multidimensionality and an insistence on interpreting a factor are highly dangerous attitudes. It is relevant to recall here the historical example of the first principal component from Spearman’s psychometric tests, which for thirty years was interpreted as a single factor of “general intelligence” (the famous G factor), until Thurstone showed it to be completely vacuous—as Spearman himself recognized at the end of his life—by revealing the deeply bi-dimensional structure of the tests (verbal and mathematical dimensions), and the fact that G was only weakly correlated with each of these two dimensions. This could be amusing had the G factor not been used as the “scientific basis” that justified the premature exclusion of large numbers of British children from the school system in order to save money (Gould, 1981). As this episode shows, it is tautologically unrealistic to seek to reduce a multidimensional reality to a single dimension.
[7]A model is made up of constraints: the presence or not of some aspect of social reality, its quantification, and the form of the relationship between the different quantified aspects. Depending on the choices made, some phenomena will be revealed directly while others, invisible as such, will appear as “ghosts”, by transferring their effect onto the aspects present in the model. This is exactly where the danger lies, when one is unaware of what is invisible.
[8]Strictly speaking, by their mode of selection, the original explanatory variables selected on the basis of their correlation with these factors are not free of endogeneity, but this is true of all methods for selecting explanatory variables.
[9]From the household questionnaire. This survey was conducted by the Institut Fondamental d’Afrique Noire (IFAN) and by the Institut de Recherche pour le Développement (IRD).
[10]From the life history questionnaire of the IFAN-IRD survey.
[11]According to the results from our survey. Most of the judicial proceedings are instigated by women.
[12]Religious marriage is supposed to be registered later with the civil registration authorities, but this is far from being always the case.
[13]Divorce in the broad sense (judicial or not).
[14]The survey was conducted in Dakar by an IRD-IFAN team (Antoine and Fall, 2002) thanks to funding from the CODESRIA (Conseil pour le Développement de la Recherche en Afrique ) and the IRD.
[15]Corresponding respectively to the 1967-76, 1957-66, and 1942-56 birth cohorts. These cohorts reached the age of family building in sharply contrasted contexts.
[16]The works published to date from this survey concern mainly women, on whom the information seems to be more reliable. Because they marry considerably earlier than men (an age difference of around 10 years), the analysis of divorce also concerns more cases.
[17]Observation ends if one of the partners dies.
[18]Birth of a child, change of occupation, change of address, etc.
[19]The time-varying variables are occupation, children, and type of union.
[20]The other causes of divorce include a difficult cohabitation with in-laws or between co-wives (Dial, 2001).
[21]For a fuller explanation of event history analysis of nuptiality, see Antoine (2002).
[22]For persons who have not divorced, observation is ended by truncation at the date of the survey.
[23]The Wolof form the majority at Dakar, and their cultural practices are increasingly being adopted by the other ethnic groups.
[24]The great majority of the population of Dakar are Muslim. Among these Muslims we distinguish the members of the Mouride and Tidian brotherhoods.
[25]A well-known example is the prohibition of divorce for Catholics.
[26]Other ethnic group is a heterogeneous category.
[27]For example, there is a strong probability of a correlation between place of birth and place of socialization. Redundancy of this kind poses no problem at all for TCA.
[28]On each thematic plane we have also projected all the variables of the other themes (their names are in italics), so as to check for no strong overlap between the themes. That form of multicollinearity between themes would invalidate the thematic model proposed.
[29]On this question see Antoine et al., 1998.
© Cairn 2007 Vie privée | Conditions d’utilisation | Conditions générales de vente
À propos | Éditeurs | Bibliothèques | Aide à la navigation | Plan du site | Raccourcis
[*]
LISE-CEREMADE, Université Paris IX-Dauphine. Suite de la note...
[**]
IRD, Équipe JÉREMI, UR DIAL-CIPRE. Translated by Godfrey I....
[suite] Suite de la note...
[1]
This is not true, however, of spectral analyses, such as ha...
[suite] Suite de la note...
[2]
For example, Cox’s regression models an instantaneous risk ...
[suite] Suite de la note...
[3]
Use of econometric modelling, based on establishing the val...
[suite] Suite de la note...
[4]
Some factor analytic techniques, such as canonical correlat...
[suite] Suite de la note...
[5]
A proxy merely represents, with some degree of error, the a...
[suite] Suite de la note...
[6]
With the factors no longer estimating a priori latent varia...
[suite] Suite de la note...
[7]
A model is made up of constraints: the presence or not of s...
[suite] Suite de la note...
[8]
Strictly speaking, by their mode of selection, the original...
[suite] Suite de la note...
[9]
From the household questionnaire. This survey was conducted...
[suite] Suite de la note...
[10]
From the life history questionnaire of the IFAN-IRD survey. Suite de la note...
[11]
According to the results from our survey. Most of the judic...
[suite] Suite de la note...
[12]
Religious marriage is supposed to