I have done a lot of research that has drawn on data about union-representation elections, originally gathered by the National Labor Relations Board. A few times a year, someone will ask me about working with the NLRB’s data, and I will explain some of the ins and outs.
Back in 2004, when I first started using NLRB data, they were quite difficult to obtain. You usually had to file a Freedom of Information Act (FOIA) request, even though the Board’s records of these elections are supposed to be public once the case has been closed. The situation improved greatly under the Obama Administration. In particular, the NLRB began uploading its closed records to data.gov. For the first time, you could just download the data and work with it!
Simply put, the NLRB botched their data upload. The resulting files have several quirks having to do with their file encodings that make them impossible for many people to work with. They also zipped files into archives that have the wrong names, and they pointed some of their links on data.gov toward the wrong files, such that the user is led to think that data for certain years are missing. Worst of all, they have not provided good codebooks for these files, though some of that information is available elsewhere.
In hosting this page, my goal is not to address all of the NLRB’s shortcomings. I do not plan for example to rewrite codebooks for al of their files. (I have included an Excel file that has basic descriptions of each of the files in the database.) Rather, my goal is to assemble as much of the raw NLRB representation-case data in one place, and explain what is available and what is not, so that at least future researchers do not have to reinvent the wheel.
What are “representation cases?”
If you want to form a union in the United States, the main channel through which you do so is a secret-ballot election held at the workplace. In this election, the workers vote for whether they want a particular union to be certified as their representative for purposes of collective bargaining with their employer. I discuss this in greater detail in several of my published papers, particularly my “Eyes of the Needles” piece. These elections are overseen by the National Labor Relations Board. For each election, the NLRB opens a case file. In the agency’s terminology, these are called “R-cases.” (This distinguishes them from “C-cases,” through which the agency investigates charges of unfair labor practices.)
Unions are not in the best of shape in America today. This is not the place to discuss my views on organized labor. I will stress though that, for organizational scholars, the labor union is an interesting, and in some ways unique, organizational form that deserves more attention. One of the things that first drew me to working with these data was that unions file petitions to hold elections, and the NLRB opens its R-case files, before the election itself is held. This means that, conditional on making it to the stage where a petition is filed, the NLRB has records of failed organizational founding attempts. This is rare; most of the time, we only have records of organizations conditional on their establishment. This means that we can answer certain questions with NLRB data that we cannot answer elsewhere. An example: say you care about diversification. Usually, if you only have something like a firm’s sales data, you can see whether they successfully entered a new product market, but you might not be able to tell whether they tried and failed to enter a market–versus never trying at all. With union data, failed “diversification” attempts are visible.
Another advantage of these data is that representation elections happen at the level of the establishment, not (usually) at the level of the firm. This opens the door to do various spatial analyses, such as looking as spillover between other forms of protest and union organizing in different cities (see my “Osmotic Mobilization” paper with Thomas Dudley and Sarah Soule). This also means that representation elections can be matched to other establishment-level data, such as that on workforce composition, as I did in these two papers.
Simply put, even if you do not care about the phenomenon of union representation elections as such, I think that, for social scientists, these organizational-founding attempts represent a strategic site (in the Mertonian sense) for testing theories. This is because the insitutional details of how the union setting often give you a window on an organizational process the workings of which are informative for certain theoretical predictions, but that is normally not observable to the researcher. One more example, just to flesh out that abstract statement: unions can limit managers’ ability to hire and fire employees in ways similar to what modern diversity policies do. While we cannot see or model the adoption of diversity policies by firm management, we can model the adoption of unionization through these elections. I exploit this fact to test the predictions of organizational sociology about the effectiveness of such diversity policies in my “Control of Managerial Discretion” piece.
Enough of that. On to the data.
Two databases…maybe 2.5
From 1999 through 2010, the NLRB’s database was called CATS, the Case Activity Tracking System. Representation and Complaint cases were both recorded in it. The “R-case” side of the database comprised about two dozen tables. The chief key linking these tables was the “r_case_number” field, though for a few tables the “election_id” serves the same purpose. Data dumps for these years consist of two sets of records. One supposedly has the full database, the other contains “frequently requested fields.” I say “supposedly” for the first because there are variables, such as the bargaining-unit type, listed in the frequently requested fields file that are not found in any of the tables of the “full” database. I have no explanation for this. The implication is that, if one wants all the available data, the “FRF” file is not a strict subset of the full database but must also be compiled.
The CATS data–both the full tables and the FRF files–are what the NLRB has posted to data.gov over the years. There are several errors in their postings. The 2000 and 2003 full-file links, for example, are pointed at the incorrect file names. Fortunately their naming convention is standard enough that a URL with the “correct” file name brings up the file for download. The FRF files, meanwhile, are almost all misnamed. The 1999 file does indeed have the 1999 data, but the 2000 file also has the 1999 data! This pattern persists, such that the last file available, 2010, actually has the data for 2009. Thus as best as I’ve been able to determine, the FRF files for 2010 are currently unavailable.
The CATS data were exported from the relational database as XML files. Here again, there is a problem. The files posted online are encoded in UTF-16 rather than UTF-8. This is not the place to explain what that means precisely (though I highly recommend this video for a tutorial), but–since there is no other indication that these are in 16-bit encoding–any text parser treats these files as having a non-printing null character between every other character. This completely chokes the parsers.
It took me a long time to figure out that this was what was going wrong with these files. (It’s a subtle thing, because you can’t see non-printing characters on the screen in most editors!) Fortunately the fix is relatively easy, and then all you have to do is move from XML to some more analyzable format. I chose to do that by passing through CSV to Stata’s DTA format. But I know that R is ascendant, so splitting code into two parts, such that people can just build the CSV and use it from there, seems prudent.
With this in mind, I have made a GitHub repository that has copies of all of the NLRB’s original files, both the full tables and the FRF files, as well as the necessary code to transform these into CSV files and/or Stata’s DTA format.
nlrb-cats repository on GitHub
Documentation about how to use the scripts therein is provided in the README.md on the site.
From 1984 to 1999, the NLRB used a different database system, called CHIPS (Case Handling Information Processing System, IIRC). The data from that system has never been uploaded, and may be lost to history. However, since the early 1960s, the NLRB had a data-sharing agreement with America’s largest trade-union federation, the AFL-CIO. Every month the NLRB would send the AFL-CIO a data-dump with results from various representation and (sometimes) complaint cases, and in return the AFL-CIO would answer questions about said data that people posed to the NLRB. As a result, the AFL-CIO has records of some aspects of representation cases stretching back to the mid-1960s.
During my dissertation, I got to know people in the AFL-CIO’s collective-bargaining department (special thanks to Gordon Pavy, Alfonso Nevárez, and Sheldon Friedman), who shared with me some of the Federation’s historical data on union-organizing drives. One result from that research was a file that has the results of representation elections from 1965 through 1998.
This file is not as complete as the CATS data. In particular, it does not contain information on whether unfair labor practices were involved in the organizing campaign. It also records election results, and thus does not have information on the withdrawn election petitions. Thus these data suffer more from survivor bias than the CATS records. Nonetheless, it is a valuable example of long-term longitudinal data at the establishment level, and I think it should be more widely available.
nlrb_old_rcases repository on Github
This too I have in a repository on GitHub. I have included the Stata-formatted DTA file as well as the script that generates the file. I have not included the original, raw text file, because it was exceedingly large. Because it had large numbers of 255-character string fields, many of which were empty, the file was over 2.5 gigabytes of mostly empty space. Reconstructing the data dictionary for this fixed-width file was Borgesian!
Why I built this
As of this writing (summer 2016), I have been working with the NLRB’s data in some form or fashion for more than a dozen years. With each new project, I think to myself, “I should really clean these up right.” Then I get busy with the project itself, and don’t get around to it. Recently, a colleague at another school asked me if I knew where something was in the NLRB data, and while I knew the answer, I didn’t have a simple link or file I could send him. Answering his request (which was a very reasonable one, given the work I’ve done with these data) required rebuilding some of these files virtually from scratch. At that point, I was disgusted with myself, and decided to get this documented once and for all.
So, in the end, I guess that this page is the academic equivalent of paying it forward.