What are "microdata"?
The unit of observation is the individual. This is the response of specific individuals to census questions. This is NOT aggregate data or summary statistics. You will not find pre-tabulated Census tables here, for example, you won't find marital status by sex for some locality. Microdata offer a lot of flexibility: you can aggregate the data yourself to make tables, or more likely, conduct individual-level multivariate analysis. This is what an IPUMS data table looks like.
What are "harmonized" variables?
Harmonization allows you to compare variables across different censuses and different countries. For example, most censuses ask about marital status. The classification scheme might differ between censuses. Some might have a general category of "married." Some might have a category for religious marriage and a category for civil marriage. The numeric codes underlying the census might also differ: divorce is "4" in one census and a "2" in another.
Variables are recoded from each census using correspondence tables. Here is an example of a correspondence table that IPUMS uses to record the marital status variable.
IPUMS uses a composite coding scheme so that it doesn't lose too much valuable information. You can see how data might be reduced to the lowest common denominator otherwise. The first one or two digits of the code provides information across all samples. The next one or two digits are available in a broad subset of samples. Trailing digits are only rarely available. In the marital status example, the leading digit "2" represents "Married/In Union," while "211" indicates "Civil" marriage. And even though the source census's underlying numeric codes vary across years, they all end up the same in IPUMS.
The other aspect of harmonization is that you get good variable documentation that would otherwise not be self-evident. This emphasizes issues for international comparisons, as well as country-specific discussions. IPUMS links to both English-language and original language questionnaires and instructions.
What are "source" variables?
Source variables are unique to each census sample. These are un-harmonized, so they don't necessarily have the same codes and labels across countries and years.
What are "pointer" variables?
These are variables that point to family inter-relationships, so you can construct individual-level variables about co-resident persons: occupation of spouse, age of mother, etc.
What are "general" and "detailed" versions of variables?
Some variables have general and detailed versions. For example, you can get 1-digit general version of "employment status" or use a fully detailed 3-digit version.
What are "weights"?
Most IPUMS samples are unweighted or "flat": every person in the sample data represents a fixed number of person in the population. For example, let's say every person in the sample represents 10 other people. One-quarter of IPUMS samples are weighted, with some records representing more cases than other. For example, let's say one person represents 5 people, while another person in the sample represents 20 people. You need to apply the person weight (PERWT) or household weight (HHWT) variable in order to obtain representative statistics for the entire population. When running regressions, the weights won't change point estimates, but can effect standard error.
If you are helping people use IPUMS, make sure to tell them they need to pay attention to these weights - they need to extract the weights variables (this is the default for all extract) and use them properly.
This is a really helpful blog post from IPUMS explaining sample weights. (An Island has 1000 birds: Hummingbirds and Pelicans...)
What is a "universe"?
A universe is the population at risk of having a response to the variable in question. For example, children are not usually employed in a specific country, so they are excluded from employment questions. The universe for the employment variable is 16 years and older.