Can we understand income polarization using siZer?

A guest post by Maria Grazia Pittau:

Rather than revelling in the dolce vita, Italians are battling with the carovita (the high cost of living), Newspaper headlines warn that the “Middle class has gone to hell” and “Italians don’t know how they will make it to the end of the month” The Guardian, Tuesday, December 28, 2004.

Considerable progress has been made in empirical research to measure the degree of polarization in the income distribution, not properly captured by inequality indexes. In general, “given any distribution of income, the term polarization means the extent to which a population is clustered around a small number of distant poles” (Esteban, 2002, p.10). How many poles and how distant they are can be regarded as structural features of the whole income distribution. Kernel densities are very good at answering these questions, since features as location, multi-modality and spread can be observed simultaneously. The choice of the bandwidth parameter h is a crucial issue in kernel density since it governs the degree of smoothness of the density estimate. Kernel density estimation can model the data in lesser or finer detail, depending on the extent of smoothing applied.

For income data an adaptive bandwidth is suggested, that is a bandwidth varying along the support of the data-set allows one to reduce the variance of the estimates in areas with few observations (generally, in the tails of the distribution), and to reduce the bias of the estimates in areas with many observations (generally, in the middle of the distribution). The analysis based on kernel density relies to a great extent on the visual impression. When the visual impression seems to corroborate the presence of more than one mode in the distribution, further investigation should be devoted in identifying the sub-populations cluster around the modes.

But are the modes really there? Or are just spurious artifact of the data?

To assess which observed features in the income distribution are “really there”, as opposed to being spurious sampling artifact we follow the Sizer approach (Chaudhuri and Marron, 1999, 2000). The SiZer is a graphical tool for the display of significant features with respect to location and bandwidth through assessing the SIgnificant ZERo crossing of the derivatives.

The main advantage of the SiZer is that, for a wide range of bandwidths, it looks at how changes in the bandwidths affect a particular location of the empirical distribution. It searches for the robustness of the shapes at varying bandwidths instead of focusing on a “true” underlying curve.

An important feature is a “bump”. The role of SiZer is to attach significance to these bumps. When a bump is present there is a zero crossing of the derivative of the density smooth and the bump is statistically significant (a mode) when the derivative estimate is significantly positive to the left and significantly negative to the right. Analogously, for a dip.

The SiZer approach has two graphical components: A family of nonparametric curves indexed by the smoothing parameter, scale space surface, and the SiZer map that displays significant features with respect to location and bandwidth through assessing the SIgnificant ZERo crossing of the derivatives. The SiZer map displays information about the positivity and negativity of the derivative of the kernel estimator. Each point in the map represents a point indexed by the location in the horizontal axis (x) and by the bandwidth on the vertical axis (h). For a resolution level h, the estimated derivative of the kernel estimation is significantly positive (negative) when all the points within a given confidence interval are positive (negative), that is the (gaussian) kernel distribution is significantly increasing (decreasing) at that location.

mg-Image2.png
Figure 1

Figure 1 represents the scale space surface that is an overlaid family of empirical kernel distributions, each corresponding to a different bandwidth. The family plot gives the idea that no single bandwidth can explain all the information available in the data. The corresponding SiZer map sheds more lights on the crucial question of which modes are statistically significant at any given level of resolution.

mg-Image1.png
Figure 2

Figure 2 reports the net annual disposable Income in Italy in 2002 of all the members of the household after tax and social security transfers for different bandwidths. Household incomes are adjusted for different household sizes. Household incomes are reported in 1995 prices using the consumption deflator of the national accounts.

The SiZer map in Figure 2 has the horizontal axis, y, which represents the household equivalent income, and the vertical axis, log10 (h) represents the bandwidth. The log10 scale used for the bandwidth in the map is chosen to display smooths that are more equally spaced. The horizontal black line represents the optimal (pilot) bandwidth.

The portions of the display are colour-coded: a color, say light gray, when the derivative is significantly positive; a different color, dark gray, when the derivative is significantly negative. The points at which the derivative is not significantly positive or negative appear in the black region of the SiZer map.

The two modes of the Italian income distribution in 2002 are detected for a wide range of bandwidths. These two modes are located around 7,201 euros and 10,254 euros, indicating the presence of two groups of households in Italy: a poor and a rich group. So, the SiZer approach provides a graphical counterpart of measures of polarization to continuous distributions, applied in Duclos et al. (2004). Although number and location of the modes cannot directly linked to the degree of polarization, the emergence of multiple modes, their intensity and separateness, may help relating the changes in the shape of distributions to changes in the polarization measurements.