You have reached the Federal
Depository Library Program Electronic Collection (FDLP/EC) Archive. This
archive assures ongoing access to publications in the FDLP/EC. Access to publications
in the archive is available through GPO's suite of cataloging and locator
services, including the Catalog of U.S. Government Publications (CGP).
Archived copies of publications are retrieved in the event that publications
are no longer available from the original publishing agency site.
Title:
Race and ethnicity
classification consistency between the Census Bureau and the National Center
for Health Statistics
SuDoc Number: C 3.223/27:17
Item Number:
0154-B-55 (online)
CGP system
Number: 001151695
2021-06-02
Publication content below:
Race and Ethnicity Classification Consistency Between the Census Bureau
and the National Center for Health Statistics February 1997 Working Paper Number POP-WP017 Larry Sink ABSTRACT The method of demographic analysis is
applied to individual birth and death records obtained from the National
Center for Health Statistics (NCHS) to produce a series of estimates of the
population of age 0 at the time of the 1990 Census. These estimates differ in
the way that race or Hispanic origin is assigned, and they are compared to
the corresponding 1990 Census figures to determine the degree of consistency
between the race and ethnicity classifications used by the two agencies and
the effect on this consistency of changing the rules by which race and
Hispanic origin are assigned. The principal findings are that assigning
births the race and Hispanic origin of the mother produces the greatest
consistency with Census results and that under this rule the agreement
between Census and NCHS on Hispanic origin is good and the agreement on race
is good except for a problem with American Indians. Disclaimer The views expressed are attributable to the
author and do not necessarily reflect the views of the U. S. Bureau of the
Census. If you have any questions concerning this
report, please e-mail a message to [email protected]. Include the name of this
report and author in the body of the message. CONTENTS Abstract SUMMARY TABLES Summary Table A TEXT TABLES
Introduction The National Center for Health Statistics
(NCHS) is a prominent supplier of vital statistics and the Bureau of the
Census is a prominent supplier of population statistics. It is common
practice to construct vital rates using a numerator obtained from NCHS and a
denominator obtained from the Census Bureau. Since both sources offer these
statistics broken down by racial and ethnic categories that appear to be
consistent with one another, it is also common practice to use this same
procedure to construct vital statistics broken down into these same racial
and ethnic categories. However, the Census Bureau relies on
self-identification in assigning racial and ethnic categories whereas in NCHS
data these categories may be assigned by an observer; and it is not clear
that these two approaches would necessarily produce consistent results. This
paper uses the method of demographic analysis to construct estimates of the
population less than one year of age at the time of the 1990 census from NCHS
vital statistics, and compares these estimates to the corresponding estimate
from the 1990 census to determine the degree of comparability between the
racial and ethnic categories used by the two agencies. It should be noted that
the results presented here only pertain to the race and ethnicity
classification systems currently in use by Census and NCHS; this is important
because proposed changes to Office of Management and Budget (OMB) Directive
15 could require both agencies to change their classification systems.
Demographic analysis is a method of
population estimation that does not rely on a census or surveys, but rather
on birth and death records and estimates of migration (see ref. 1 for a more
complete discussion of this topic). Demographic analysis population estimates
are constructed by age group for a specified geographic area, usually a
nation. Constructing a demographic analysis estimate of, for example, persons
50 years of age in the United States would involve tabulating all births
recorded in the United States occurring at least 50 but less than 51 years
ago, subtracting all deaths recorded to this cohort and all moves abroad by
members of this cohort, and adding all moves from abroad into this cohort.
Other data may be used to supplement the vital statistics where they prove to
be inadequate, for example, the Census Bureau uses administrative data on
Medicare enrollment in its estimates of the population 65 and older.
Demographic analysis has been used by the Census Bureau to assess the
completeness of coverage of the census following every census since 1960.
Because a demographic analysis estimate of the full population requires the
use of data covering a long period of time, it is limited in the amount of
racial and ethnic detail that can be included owing to the difficulty of
finding consistent classification schemes that have been in place over the
whole period in question. By focusing on the population of age 0 at the time
of the 1990 census, we may construct a demographic analysis estimate entirely
from recent birth and death data which contain race and ethnic detail
comparable to that found in the 1990 census. In keeping with OMB Directive 15, all federal
agencies have been moving toward a four-race classification system (White,
Black, American Indian and Alaska Native, and Asian and Pacific Islander) and
an ethnicity classification system that would permit classification of all
individuals as either Hispanic or non-Hispanic. The OMB system is designed to
promote standardization in federal record-keeping and reporting, and does not
permit the use of categories which cannot be aggregated into the specified
categories. What is of particular importance for this paper is that the OMB
system does not permit the use of categories such as "other" or
"unknown". Both the Census Bureau and NCHS deal with missing or
otherwise unusable race data by reassigning the value to one randomly selected
from observations with valid values (see ref. 1 and 5 for descriptions of the
selection processes). However, the NCHS data used here contains the original
values, which permits the comparison of a variety of approaches to the
missing value problem in order to see how the compatibility of the two data
sources may be maximized. Since 1989, NCHS has had a new birth
registration system in effect, which includes detailed racial and ethnic
information about both parents. The Census Bureau has received individual
record data on all births and deaths recorded in the US since 1989 from NCHS,
which includes detailed information on race and ethnicity. The birth data
contains race and ethnicity information for both parents, and NCHS has used
the parents' race information to impute the race of the child (see ref. 2 for
description of imputation procedure). I have used this data to construct
demographic analysis estimates of the population less than one year of age at
the time of the 1990 Census by race and by Hispanic origin. This paper compares
those estimates to the corresponding Census counts. The Census Bureau data used here is the
Modified Age Race Sex (MARS) file, which contains data from the 1990 census
modified to correct age and race mis-reporting. The modification methodology
is described in Census Report CPH-L-74. There are two aspects of this
modification which are important to note here. The 1990 census results
included about 10 million persons for whom the race response was not one of
the categories listed on the census form, and who thus had to be recoded into
one of the four OMB categories in accordance with Directive 15. The intent of
the age question on the census was to obtain age in completed years on April
1, 1990, but many respondents gave age at the time they answered the
question, and tended to round to the nearest year. This produced a
substantial undercount of the population of age 0 on April 1, 1990, which had
to be modified in the MARS file. Table 1 presents a list of the race
categories present in the NCHS data and shows how they correspond to the
four-race system just mentioned. Table 2 presents the corresponding
comparison for ethnicity. Because the intent of this paper is to examine the
underlying consistency of the two data sources, a variety of methods of recoding
the missing values will be presented to see how the confounding effect of
differing recoding schemes may be minimized. Separate comparisons of data
from the mother and father, and, in the case of race, for the child, will be
presented to determine which gives the closest agreement with the Census
Bureau data. The demographic analysis estimates used in
this paper were constructed by tabulating the births occurring between April
1, 1989 and March 31, 1990 and subtracting the deaths to that cohort occurring
in the same period, by race and ethnicity. Estimates of net international
migration to this cohort were prepared using the methodology described the
technical documentation accompanying PE-29 (ref. 6). Since the goal here is
to compare NCHS and Census Bureau data, these estimates were subtracted from
the corresponding MARS figures to produce an estimate of the native-born
population comparable to the demographic analysis estimate constructed from
NCHS data. No attempt was made to allow for domestic migration, owing to the
lack of a reliable method for estimating the state-to-state migration by race
and ethnicity for the age 0 population. This omission will not affect the
national-level estimates, and its effect is likely to be small for most
states over a one-year period. Consequently, the detailed tables present
results at the state as well as the national level to illustrate how the
effects differ from state to state, though it should be kept in mind that the
state-level results are merely illustrative. Table 3 presents NCHS- and Census-based
population estimates and the percentage by which the Census-based estimate is
less than the NCHS-based one (a negative sign indicates that the Census-based
estimate is actually larger than the NCHS-based estimate). Demographic
analysis of the 1990 census revealed a net undercount of slightly less than
2% for the total population. Since under-registration of births is known to
be about 1/2%, we would expect the Census total in Table 3 to be about 1�%
below that of the NCHS total if the net undercount for age 0 is the same as
for the population as a whole. The fact that the Census total is actually
about 2�% below NCHS is probably the result of the particular problem the
1990 census encountered with the age 0 population that was discussed
previously, and still constitutes reasonable conformity with our
expectations. It is interesting to note that there are 18 states for which
the census-based estimate is higher than the demographic analysis estimate.
This may be the result of the failure to account for inter-state migration or
reflect a net overcount in the Census for these states, but it may also be at
least partially attributable to the mis-assignment of state of residence by
NCHS in cases where women give birth outside of their state of residence. The
large discrepancy between the demographic analysis and census-based estimates
for the District of Columbia is probably only partially attributable to
census undercount in DC, and is likely to be in large part the result of geo-coding
errors by NCHS, since it is known that a substantial number of Virginia and
Maryland residents give birth in DC hospitals. Because demographic analysis is generally
considered to be the preferable method of population measurement, it is
considered to be the standard by which the census is judged when the two are
compared with respect to the level of population. With regard to race and
ethnicity distributions, however, the following analysis will show that the
error associable with undercount is relatively small when compared to the
differences which can be introduced by differing classification systems.
Because of the difficulty and the empirical impact of issues related to
racial and ethnic classification, the focus of this paper cannot be on which
source is correct, but rather it must be the degree to which NCHS and the
Census Bureau are consistent with one another. During the research that went
into this paper it became clear that the issues which most affect this
consistency are mixed parentage and treatment of missing values.
Consequently, the following analysis compares the race and ethnicity
distributions in the MARS data to those in demographic analysis populations
constructed under a variety of assumptions regarding mixed parentage and treatment
of missing values. To measure the degree of comparability
between the two sets of estimates, I use the mean absolute percent error
(MAPE), which in this circumstance involves computing 100*abs(MARS-DA)/MARS
for a given racial or ethnic group [where abs() refers to the absolute value
function, MARS denotes the proportion of the MARS population in the group in
question, and DA denotes the corresponding proportion from the demographic
analysis population], and taking the mean of this quantity over all the
groups in the comparison. This calculation effectively treats the MARS
distribution as the standard in this comparison, because the demographic
analysis distributions are being altered experimentally to investigate the
circumstances under which they will and will not resemble the MARS
distributions.
Tables 4-21 pertain to the comparability of
NCHS and Census Bureau race classifications, and these results are summarized
in Summary Table A. These tables compare the race distributions found in the
NCHS and Census Bureau data just discussed, both at the national and state
levels, and present MAPEs for these comparisons. The same admonitions
mentioned earlier regarding the state-level estimates should be applied here.
Additionally, the MAPE tends to be unreliable if one of the groups in the
comparison contains a small number of observations. None the less, the
state-level MAPEs serve to show where the discrepancies between the two
sources are the most pronounced, and indicate the states where particular problems
arise from the application of certain schemes for dealing with unknown
values. In Tables 4-6, the missings are deleted
before computing the proportions, which has the same effect as recoding the missings to the four race groups in the same proportions
as the non-missing values. In Tables 7-9, the missings
are recoded to White. In Tables 10-12, 13-15, and 16-18, the missings are recoded to Black, American Indian, and
Asian, respectively. In Tables 19-21 the missings
are recoded to the most recently observed valid value in the data, which is
basically the same method used by NCHS and should produce results comparable
to those that would be obtained by using the data NCHS releases to the
public. Tables 4, 7, 10, 13, 16, and 19 use NCHS race of child, Tables 5, 8,
11, 14, 17, and 20 use race of mother, while Tables
6, 9, 12, 15, 18, and 21 use race of father. It is immediately clear that
race of father performs poorly. Race of father has more unknowns than either
race of mother or race of child, and this is probably the reason for its poor
performance. Race of mother with unknowns recoded to American Indian produced
the lowest national-level MAPE of any of these approaches. Race of mother
substantially underestimates the American Indian population regardless of
what is done with the unknowns, and consistently provided the best estimates
of the other three race groups. Summary Table A National-level MAPEs from NCHS vs. Census
Race Comparison
To further investigate this problem with American
Indians, the Table 14 calculations were rerun the same as before except that
race of father was used when the father was American Indian. The results are
presented in Table 22, which shows that this procedure reduced the average
difference from 4.1% to 2.8%. While the comparisons reported in Tables 4-22
apply the same procedure to all missing values, the results presented in
Table 23 are based on a procedure which recodes missing values on a
state-by-state basis. In this procedure, each state's missing values are
recoded to the race whose proportion in the demographic analysis population
is furthest below the corresponding proportion in the MARS data. This
approach was developed to determine if it would improve upon the results
reported in Table 22, and it does, lowering the national-level MAPE to 1.9%.
Table 24 displays the number of missing values by state (for race of mother)
and the race to which they were recoded under this scheme. Though this
approach produces the best overall result, it does particularly badly in the
District of Columbia. It is interesting to note that the approach which works
the best in the District of Columbia is to use race of father and recode
missing values to Black, which is one of the worst approaches for the nation
as a whole. Since NCHS uses race of mother in the
tabulations it presents in its official reports and randomly recodes missings as previously discussed, the results presented
in Table 20 give a good idea of the level of agreement between the Census
Bureau's racial categories and those used in NCHS's official publications.
The overall MAPE of 5.6% indicates reasonably good agreement, but closer
inspection reveals that almost all the discrepancy comes from the estimates
of the American Indian population, where the national-level NCHS-based
estimate is some 20% below that obtained from Census Bureau data.
Consequently, it would be more accurate to say that Census Bureau and NCHS
racial classifications show very close agreement except with regard to
American Indians, where a problem clearly exists. This problem probably has
to do with the fact that significant privileges and advantages can result
from membership in certain American Indian tribes, thus giving parents an
incentive to designate their children as American Indian if either parent is
American Indian. Tables 25-34 deal with the comparability of
the NCHS and Census Bureau Hispanic/non-Hispanic categories, and these
results are summarized in Summary Table B. The comparisons are analogous to
those just presented for race. They present all possible combinations of the
choices between mother's and father's ethnicity and the choices among
omitting unknowns, recoding them to non-Hispanic or Hispanic, and assigning
them the last valid value in the data. Additionally, Tables 33 and 34 treat
unknowns by recoding them to the value of the other parent. At the national
level the best choice appears to be using the mother's ethnicity and omitting
unknowns, which produces a disagreement between the NCHS and Census Bureau classifications
of less than 1%, though very similar results were obtained using mother's
ethnicity with unknowns recoded to father's ethnicity. As can be seen from
the tables, however, there is considerable variation in the performance of
the various approaches at the state level. Thus, while mother's ethnicity
with omitted unknowns fares best at the national
level, at the state level this approach produces the best results only for
California, with the other states obtaining better results from one of the other
approaches. Using ethnicity of father with unknowns recoded to Hispanic is
consistently the worst approach. Summary Table B National-level MAPEs from NCHS vs. Census
Ethnicity Comparison
In its published data relating to Hispanic origin,
NCHS uses ethnicity of mother and recodes missings
to non-Hispanic. Consequently, the results presented in Table 29 should show
the level of agreement between the Census Bureau's Hispanic/non-Hispanic
categories and those used in NCHS's published reports. As can be seen, the
agreement is quite good, with a national-level MAPE of only 1.3% and no
serious problems in any of the individual states.
The overall conclusion to be drawn from
this work is that the racial and ethnic classification systems used by the
Census Bureau and NCHS show a high degree of agreement with each other,
except for a problem with American Indians. A secondary conclusion is that
ascribing the racial and ethnic characteristics of the mother to the child
seems to afford consistency with Census Bureau classifications except when
the father is an American Indian. The cause of the American Indian problem
seems to be twofold. First, because they are such a small proportion of the
population, American Indians are more likely than other racial groups to
marry a person of another race. In the NCHS data used in this analysis, there
are more births where one parent was American Indian and one was not than
there are where both parents were American Indian. Second, in mixed race
couples where one partner is American Indian, there is an incentive to
identify the children as American Indian because of the advantages that can
result from tribal membership. Consequently, any racial classification scheme
that automatically assigns children the race of one parent or the other would
report substantially fewer American Indian children than would be reported in
a system permitting self-identification. Further, this tendency of children
of mixed-race marriages to identify racially with their American Indian
parent means that there are self-identified American Indians who are
biologically only a small portion American Indian and who would thus be
unlikely to be identified as American Indian by an observer. This suggests
that this inconsistency with respect to American Indians is likely to be
found in mortality data as well as fertility data. As a result, those who are
interested in the fertility and mortality of American Indians need to take
great care in combining Census and NCHS data and in using data that draws
from both sources (e.g. NCHS's published vital rates). Those who are not
concerned with this problem or with race-ethnicity cross-classification may
regard the current NCHS and Census Bureau race and Hispanic origin
classification systems as completely compatible. It is important to keep in
mind that these results only pertain to the present systems. If these systems
were to be changed, which currently seems likely, this analysis would have to
be repeated using the new systems. It should be stressed that consistency
between the two agencies with respect to race classification and Hispanic
origin classification does not imply that they are consistent with respect to
race/Hispanic origin cross-classification, and research done at the Census
Bureau indicates that they are not. My next work in this area will be to
extend the analysis of this paper to the issue of the apparent inconsistency
in race-ethnicity cross-classification between the Census Bureau and NCHS.
Research is already in progress at the Census Bureau on a reliable method of
estimating U.S. internal migration of the age 0 population with race and
ethnic detail. Such a method would allow us to extend the analysis of this
paper to smaller levels of geography, which could be very helpful, given that
the illustrative state-level results presented here indicate that
classification inconsistency and the reasons behind it may vary considerably
from region to region. The issue which is likely to be of the greatest
interest, however, is which classification system yields the best estimates.
This issue is fraught with difficulties, not the least of which is defining
what is meant by "best estimate", and the work presented here is
only a small step towards dealing with it. In addition to further work along
the lines of inquiry begun here, more work is also needed on how we define
race on a scientific level, on how we identify ourselves racially on a
personal level, and on the nature and extent of the differences between these
two definitions.
--------------- |