From CUNY Academic Commons

Contents

IPUMS data, like other data used at the Center, is stored on the CUR database server, currently named Neptune (neptune.gc.cuny.edu). Other documents describe how to connect to Neptune. This document merely describes how the data is organized. An IPUMS data dictionary and usage notes are available at the IPUMS website. We do not reproduce that documentation here, but address the specific way CUR is storing the IPUMS data. So far the database contains the sample designated 1 for census years 1980, 1990, and 2000 and ACS years 2001-2007.

We follow a naming convention in which database schemas, tables, and columns are named using lowercase letters only with underscores separating words to improve readability. On the CUR database server, each data source (IPUMS, QCEW, HMDA, etc.) will get its own schema. Schemas can be thought of as a grouping of logically related tables. All IPUMS data are stored in the schema ipums. Household data is stored in the table household. Tables are referred to by their schema name, a dot separator, and their table name. In a query, the household table would be referred to as ipums.household.
In all cases official IPUMS variable names are retained. Thus, the IPUMS variable RENT appears as the column rent in the household tables. All IPUMS variables appear as columns in the household or person tables, whether or not the variable contains data for the particular year and sample. This was done so that, if we choose to add additional years and samples to the database, we will not have to alter the table structure to accommodate new variables. Thus, some columns like piplocpr (location of piped water) appear even though they only apply to a single year or sample that we don’t even (yet) have in the database (in this case the 1970 Puerto Rican census). You can (and should) create extracts only using the variables of interest.

Tables

IPUMS allows the data to be downloaded in a so-called hierarchical format. This corresponds to what in database lingo would be called normalized data. Household and person records are separated in order to avoid repeating all the household data for each person record. Household records can be uniquely identified by the columns year, datanum, and serial. Person records can be uniquely identified by the columns year, datanum, serial, and pernum. pernum begins at 1 and is numbered sequentially for all persons in the household. When querying the database, person records can be joined to their associated household using the columns year, datanum, and serial.
In order to provide faster query execution, data in the household and person tables is partitioned by year and sample. The parent tables are named, simply household and person. The naming convention for the partitions (or child tables) is household[_year][_datanum], except that datanum is dropped for sample 1 in each year. Thus, data from 1980 sample 1 is found in table household_1980 and person_1980. Data from 1980 sample 2 (not yet in the database) would be found in household_1980_2 and person_1980_2. If you want to extract data from multiple years, you can just query the parent table, e.g. person. If you only want data from a single year, such as 1990, you can query person_1990. You can also query person with the criteria WHERE year = 1990, and you will get the same result. Because of partitioning, the two queries should run at the same speed, because the query optimizer will see the WHERE year = 1990 criteria and will know that it should only look in partition person_1990. [NOTE: If we never or rarely select records from multiple samples in the same extract, it will not be useful to have samples other than sample 1 as partitions of the household and person tables. We can discuss our needs with respect to this and this documentation will be updated accordingly.]

Variables

Many variables in the IPUMS data set are coded. In the simplest case, the code is completely nominal. In this case, you may want to write a query which returns the translation of the code. The IPUMS schema follows the same naming convention for lookup tables which we apply to other datasets on the CUR database server. Lookup tables all share the same name as the coded variable, and lookup tables all contain two columns: one has the same name as the coded variable, and the second column is named descr and has the text description of the code’s meaning. Thus, to decode the variable race in the person table, we know that there is a table named race with two columns, race and descr. The purpose of this naming convention is to avoid having the user have to consult a data dictionary to find names of lookup tables and lookup table columns. Additionally, having the join fields have the same name in the data table and the lookup table will facilitate table joins. NOTE: Not all lookup tables have been created yet. If you attempt to decode a coded variable and find out that the expected lookup table is not there, let the database administrator know so that the table can be created.
In some cases IPUMS variables represent interval data—i.e., actual numeric values, such as contract rent—except for specific values which have a special meaning. Often, for example, a value of 0 or a value of all 9s (for example 9999 in a 4-digit field) means N/A. Sometimes 0 means N/A and 1 means 0, while all other values mean what they say. The reason for doing this rather than setting the N/A field to NULL is to differentiate between a variable being N/A for that record only and a variable which is not measured for that year and sample, in which case all records in that year and sample are NULL. It is important that the researcher recognize this and not, for example, count the 0 records when calculating an average. We have decided not to change this in the canonical tables. However, we can create views (basically a saved and optimized SELECT statement) which recode the variables, substituting NULL for 0 and 0 for 1 as appropriate.