EEOC EEO-1 Survey code

I have done research using longitudinal data on establishment workforce composition. These data come from the Equal Employment Opportunity Commission (EEOC). Since 1966, to monitor compliance with the Civil Rights Act, the EEOC has required any private-sector establishment with more than 100 employees (50, if they do significant federal contract work) to file an EEO-1 establishment survey. These forms have some information about the firm itself, such as its location and industry. The key data are a matrix of occupations and race/sex tuples. Employers enter counts of employees in each cell. Thus for example an EEO-1 form will show you how many Hispanic women are employed as Technicians (See a sample form here). The EEOC does not separately verify these reports but does reserve the right to audit them. Prior studies have concluded that the EEO-1 surveys are among the best data we have on workforce composition over time.

The EEOC thinks that it is important for researchers to work with their data. It's a rare and refreshing stance for a government agency to have! Access used to be through the Intergovernmental Personnel Act (IPA). Thus I have at times been an employee on secondment to the EEOC, at a salary of $0 per year. This gave me access to the EEOC's data but also put me under the same privacy and data-protection requirements as any of the agency's employees. Since early 2018 the IPA process has been suspended--see here for more detail.

tl;dr: Researchers cannot just share EEOC data with one another. Legal privacy restrictions limit this. If one were to try to access EEOC data through the Freedom of Information Act, what data the agency could release would be heavily redacted. This would make linking EEOC data to other data sources virtually impossible. Overall, then, the agency's access program has been a net positive.

Nonetheless, these restrictions slow down research with these data. Initial conversion of the EEOC data can be a bear. The EEOC complied with data requests by sending researchers SAS7BDAT-format files. Never heard of those, you say? You are not alone! This is the "SAS 7 Binary Data" format, which is kind of old-school even if you do use SAS. Never mind NumPy, R, or Stata.

In principle, you can convert SAS files to other formats using software like Stat/Transfer. In practice, several fields in these files are compressed, using a janky, proprietary, obsolete SAS compression algorithm. S/T throws errors when converting these, and the number of errors grows quickly on the older files. There is also no reason to think that any of these errors are random.

My solution, the last time I got raw data from the agency, was to write some SAS code that iterates over the files and writes a CSV for every year. (Many thanks to Simona Abis, lately of INSEAD, who basically walked me through this on very short notice!) Then I have a Stata do-file that cleans up these CSV files. The end result is a longitudinal dataset covering 1971-2014, save for the years where the EEOC does not have digital data (1974, 1976, and 1977).

This file could be built differently--for example, I did not want to faff with encoding county information on the oldest, 1966 file, and have not included such cleaning in the script. However, I think it would be vastly easier for another researcher to modify my script than to write the dang thing from scratch. Through such fits and starts does science proceed! This file is also the starting point I use for analyses in a couple of projects, like this article on establishment-level racial employment segregation, and that I use for procedures like geocoding the establishments; so it is useful to preserve this version of the script.

I have posted all of this code on GitHub:

EEO-1 survey cleaning code on GitHub

Feedback or questions are always appreciated.