I have done research using longitudinal data on establishment workforce composition. These data come from the Equal Employment Opportunity Commission (EEOC). Since 1966, to monitor compliance with the Civil Rights Act, the EEOC has required any private-sector establishment with more than 100 employees (50, if they do significant federal contract work) to file an EEO-1 establishment survey. These forms have some information about the firm itself, such as its location and industry. The key data are a matrix of occupations and race/sex tuples. Employers enter counts of employees in each relevant cell. Thus an EEO-1 form will show you how many Hispanic women are employed as Technicians, for example. (See a sample form here.) The EEOC does not separately verify these reports but does reserve the right to audit them. Prior studies have concluded that the EEO-1 surveys are among the best data we have on workforce composition over time.
The EEOC thinks that it is important for researchers to work with their data, in part to see whether and how they are achieving their mission. It’s a rare and refreshing stance for a government agency to have! Access comes through the Intergovernmental Personnel Act. Thus I have at times been an employee on secondment to the EEOC, at a salary of $0 per year. This gives me access to the EEOC’s data but also puts me under the same privacy and data-protection requirements as any of the agency’s employees.
tl;dr: Researchers cannot just share EEOC data with one another. This is good and bad. Obviously it would be great if these data could be shared with no restrictions. Given the state of the law, though, this isn’t going to happen. And if one were to try to get access to EEOC data through a channel like the Freedom of Information Act, what data the agency could release would be heavily redacted. This would make linking EEOC data to other data sources virtually impossible. Overall, then, the agency’s access program is a net positive.
Nonetheless, this process slows down research with these data. In particular, initial conversion of the EEOC data can be a bear. The EEOC complies with data requests by sending researchers SAS7BDAT-format files. Never heard of those, you say? You are not alone! This is the “SAS 7 Binary Data” format, which is kind of old-school even if you do use SAS. Never mind NumPy, R, or Stata.
In principle, you can convert SAS files to other formats using software like Stat/Transfer. In practice, several fields in these files are compressed, using a janky, proprietary, obsolete SAS compression algorithm. S/T throws errors when converting these, and the number of errors grows quickly on the older files. There is also no reason to think that any of these errors are random.
My solution, the last time I got raw data from the agency, was to write some SAS code that iterates over the files and writes a CSV for every year. (Many thanks to Simona Abis, lately of INSEAD, who basically walked me through this on very short notice!) Then I have a Stata do-file that cleans up these CSV files. The end result is a longitudinal dataset covering 1971-2014, save for the years therein where the EEOC does not have digital data (1974, 1976, and 1977).
This file could be built differently–for example, I did not want to faff with encoding county information on the oldest, 1966 file, and have not included such cleaning in the script. However, I think it would be vastly easier for another researcher to modify my script than to write the dang thing from scratch. Through such fits and starts does science proceed! This file is also the starting point I use for analyses in a couple of projects, like this article on establishment-level racial employment segregation, and that I use for procedures like geocoding the establishments; so it is useful to preserve this version of the script.
I have posted all of this code on GitHub:
EEO-1 survey cleaning code on GitHub
Feedback or questions are always appreciated.