NLRB Representation case data

I have done a lot of research that has drawn on data from union-representation elections, originally gathered by the National Labor Relations Board. A few times a year, someone asks me about working with the NLRB's data, and I will explain some of the ins and outs.

In 2004, when I first started using NLRB data, they were quite hard to obtain. You had to file a Freedom of Information Act (FOIA) request, even though the Board's records of these elections are supposed to be public once the case has been closed. The situation improved greatly under the Obama Administration. In particular, the NLRB began uploading its closed records to data.gov. For the first time, you could just download the data and work with it!

...Sort of.

The NLRB botched their data upload. The resulting files have several quirks in their file encodings. The NLRB also zipped files into archives that have the wrong names, and they pointed some of their links on data.gov toward the wrong files, such that the user might think that data for certain years are missing. Worst of all, they have not provided codebooks for these files, though some of that information is available elsewhere.

My goal here is not to address all of the NLRB's shortcomings. I do not plan to rewrite codebooks for all of their files. (I have included an Excel file that has basic descriptions of each of the files in the database.) Rather, my goal is to assemble raw NLRB representation-case data in one place, and explain what is available and what is not, so that at least future researchers do not have to reinvent the wheel.

What are "representation cases?"

If you want to form a union in the United States, the main channel through which you do so is a secret-ballot election held at the workplace. In this election, the workers vote for whether they want a particular union to be certified as their representative for collective bargaining with their employer. I discuss this in greater detail in several of my published papers, particularly my "Eyes of the Needles" piece. The National Labor Relations Board oversees these elections. For each, the NLRB opens a case file. In the agency's terminology, these are called "R-cases." (This distinguishes them from "C-cases," through which the agency investigates charges of unfair labor practices.)

Unions are not in great shape in America today. This is not the place to discuss my views on organized labor. I will stress though that, for organizational scholars, the labor union is an interesting, and in some ways unique, organizational form that deserves more attention. One thing that drew me to working with these data was that unions file petitions to hold elections, and the NLRB opens its R-case files, before the election is held. This means that, conditional on making it to the stage where a petition is filed, the NLRB has records of failed organizational founding attempts. This is rare; most of the time, we only have records of organizations conditional on their establishment. This means that we can answer certain questions with NLRB data that we cannot answer elsewhere. An example: say you care about diversification. If you have something like a firm's sales data, you can see whether they successfully entered a new product market, but you might not be able to tell whether they tried and failed to enter a market--versus never trying at all. With union data, failed "diversification" attempts are visible.

Another advantage of these data is that representation elections happen at the level of the establishment, not (usually) at the level of the firm. This opens the door to do various spatial analyses, such as looking as spillover between other forms of protest and union organizing in different cities (see my "Osmotic Mobilization" paper with Thomas Dudley and Sarah Soule). This also means that representation elections can be matched to other establishment-level data, such as that on workforce composition, as I did in these two papers.

Thus even if you do not care about the phenomenon of union representation elections, I think that, for social scientists, these organizational-founding attempts represent a strategic site (in the Mertonian sense) for testing theories. The institutional details of the union setting give you a window on an organizational process whose workings are normally unobservable to the researcher. One more example, to flesh out that abstract statement: unions can limit managers' ability to hire and fire employees in ways similar to what modern diversity policies do. While we cannot see or model the adoption of diversity policies by firm management, we can model the adoption of unionization through these elections. I exploit this fact to test predictions from organizational sociology about the effectiveness of such diversity policies in my "Control of Managerial Discretion" piece.

Enough of that. On to the data.

Two databases...maybe 2.5

From 1999 through 2010, the NLRB's database was called CATS, the Case Activity Tracking System. Representation and complaint cases were both recorded in it. The "R-case" side of the database comprised about two dozen tables. The chief key linking these tables was the "r_case_number" field, though for a few tables the "election_id" is the key. Data dumps for these years consist of two sets of records. One supposedly has the full database, the other contains "frequently requested fields." I say "supposedly" because there are variables, such as the bargaining-unit type, listed in the frequently requested fields file that are not found in any of the "full" database's tables. I have no explanation for this. The implication is that, if one wants all the available data, the "FRF" file is not a strict subset of the full database but must also be compiled.

The CATS data, both the full tables and the FRF files, are what the NLRB has posted to data.gov over the years. There are several errors in their postings. The 2000 and 2003 full-file links, for example, point at the incorrect file names. Fortunately their naming convention is standard enough that a URL with the "correct" file name brings up the file for download. The FRF files, meanwhile, are almost all misnamed. The 1999 file does indeed have the 1999 data, but the 2000 file also has the 1999 data! This pattern persists, such that the 2010 file actually has the 2009 data. As best as I can tell, the 2010 FRF files are unavailable.

The CATS data were as XML files. Here again, there is a problem. The posted files are encoded in UTF-16 rather than UTF-8. This is not the place to explain what that means precisely (though I highly recommend this video for a tutorial), but--since there is no other indication that these are in 16-bit encoding--any text parser treats these files as having a non-printing null character between every other character. This completely chokes the parsers.

It took me a long time to figure out that this was what was going wrong with these files. (You can't see non-printing characters on the screen in most editors!) Fortunately the fix is relatively easy, and then all you have to do is move from XML to some more analyzable format. I chose to do that by passing through CSV to Stata's DTA format. But I know that R is ascendant, so splitting code into two parts, such that people can just build the CSV and use it from there, seems prudent.

With this in mind, I have made a GitHub repository that has copies of the NLRB's full tables and FRF files, as well as the necessary code to transform these into CSV files and/or Stata's DTA format.

nlrb-cats repository on GitHub

Documentation about how to use those scripts is in the README.md on the site.

From 1984 to 1999, the NLRB used a different database system, called CHIPS (Case Handling Information Processing System, IIRC). The data from that system has never been uploaded, and may be lost to history. However, since the early 1960s, the NLRB had a data-sharing agreement with America's largest trade-union federation, the AFL-CIO. Every month the NLRB would send the AFL-CIO a data-dump with results from various representation and (sometimes) complaint cases, and in return the AFL-CIO would answer questions about said data that people posed to the NLRB. As a result, the AFL-CIO has records of some aspects of representation cases stretching back to the mid-1960s.

During my dissertation, I got to know people in the AFL-CIO's collective-bargaining department (special thanks to Gordon Pavy, Alfonso Nevárez, and Sheldon Friedman), who shared with me some of the Federation's historical data on union-organizing drives. One result from that research was a file that has the results of representation elections from 1965 through 1998.

This file is not as complete as the CATS data. In particular, it does not contain information on whether unfair labor practices were involved in the organizing campaign. It also records election results, and thus does not have information on the withdrawn election petitions. Thus these data suffer more from survivor bias than the CATS records. Nonetheless, it is a valuable example of long-term longitudinal data at the establishment level, and I think it should be more widely available.

nlrb_old_rcases repository on Github

This too I have uploaded to GitHub. I have included the Stata-formatted DTA file and the script that generates it. I have not included the original text file because it was exceedingly large. Because it had many 255-character string fields, most of them empty, the file was over 2.5 gigabytes of mostly empty space. Reconstructing the data dictionary for this fixed-width file was Borgesian!

Why I built this

As of this writing (summer 2016), I have worked with the NLRB's data in some form or fashion for more than twelve years. With each new project, I think to myself, "I should really clean these up right." Then I get busy with the project itself, and don't get around to it. Recently, a colleague at another school asked me if I knew where something was in the NLRB data, and while I knew the answer, I didn't have a simple link or file I could send him. Answering his request (which was a reasonable one, given the work I've done) required rebuilding some of these files virtually from scratch. At that point, I was disgusted with myself, and decided to get this documented once and for all.

So, in the end, I guess that this page is the academic equivalent of paying it forward.