Working papers!
I've been on sabbatical this academic year. One of the side effects of that is I've actually been writing. Thus I should share two "new" papers.
The first is a draft that I posted in November and then forgot to mention: Culture as Entropy Reduction: Finding the 'Organizational' in Organizational Culture. Apologies for the lame name. This is my ongoing effort to develop an empirical approach for examining whether and when it makes sense to talk about an organizational culture as such, rather than a concatenation of national, occupational, or other cultures. It adapts the basic approach to measuring segregation that I've used elsewhere in service of the task.
The second, which I finished drafting today, I'm a little embarrased by. Not because it's bad--though go easy on the discussion, I just needed to finish the damned thing!--but because I've been nominally working on this one forever. I presented a sort of 80-percent draft of this here at ESMT (where I'm visiting) in 2022. When I arrived last month, I still had that 80-percent draft. I will not dwell too much on that though, and will instead bask in the happy feeling of being able to share a draft, at 5.30 on a Friday afternoon no less!
Same Planet, Different Worlds? Shifting Firm Boundaries and Racial Employment Segregation. This presents a theory I've been working on for a while: that the rise of external contracting, combined with the correlation of race and occupation in the United States, has let between-firm segregation rise even as measured within-firm occupational segregation has fallen and the typical workplace has come to seem more diverse. The meat is an adaptation of segregation metrics I've used elsewhere to add a spatial component and to show how the spatially-weighted metrics diverge from the aspatial ones in ways consistent with the theory.
As always, comments are welcome.
Replication never stops
Every so often I get email from people who are working with union-represenation data from the National Labor Relations Board. (No surprise here: I host files and code for the CHIPS and CATS databases.) Upon getting one recent such message, I decided to broaden the conversation, and copied everyone who had written me asking about such data in recent years. I think it has been useful for people to see what resources others have developed.
As part of that conversation, Tom Baz wrote that "I want to raise some more general questions. How do you approach connecting related R Cases and C Cases that happen in the same establishment?" To translate: the NLRB's union-election cases, or representation cases, are the "R Cases," while the charges of unfair labor practices, or complaints, are the "C Cases."
I told Tom that he had triggered my PTSD. In my first-ever sole-authored paper, I linked these two datasets together. The challenge is that the NLRB's fundamental unit in its own database is the case, which has nothing to do with more sensible units like the establishment. The NLRB doesn't record consistent employer IDs, and for many years chose not to release the address information included with these filings. Any researcher trying to link these data would be starting from scratch.
That paper was part of my dissertation. In said dissertation I wrote a detailed appendix about how I matched the records. I did this entirely because, at the time, I was getting pushback from other researchers about how I had identified ULP charges that accompanied specific union election drives. I pulled this appendix out of the thesis and shared a copy of it, which I'll also share here.
I'm not sharing this because I think there is a large untapped demand for it. I share it because, while it's existed since 2009, it exists as an appendix in a dissertation in the MIT Library system. The odds that someone would ever find it by searching the open web are extremely small. This is exactly the sort of work that gets lost in what should theoretically be a shared body of scientific progress.
I'm also sharing it because, dang, if you'd told me that the appendix that I wrote in my dissertation nearly sixteen years ago would ever be helpful to anyone, I would've laughed at you. Yet here I am, exposing how I documented my work. In a perfect world I would just direct the reader to a full replication package, but when I first started doing this project (in the summer of 2004!) those weren't much spoken of. Now I wonder: can I rebuild that project's results from scratch?
Sometimes, I get the impression that people think we only build replication packages because we want to guard against scientific fraud. And this leads to some of my colleagues complaining, stupidly, about the "replication police" or some such. But it has to be said, time and again, that you should take the time to document your work in reproducible form because it helps the research process. There is no better way to ensure that research will be built upon than to make sure that people can see how it was built in the first place.
UPDATE (24 January): Sixteen days after I wrote this post, I got another email asking me about how I matched RC and CA cases in that paper, so many years ago. I was delighted to direct them to this post! Rarely is a good turn so quickly repaid.
New working paper
I am six weeks into the first academic term in four years where I am not teaching any courses. How good that feels! This frees up time to push projects through the pipeline.
With that in mind, a project several years in the making with Victor Bennett, Rem Koning, and my graduate student Masoomeh Kalantari finally has a 1.0 draft. "Demographically Biased Technological Change: Investments in Industrial Automation Disproportionately Benefit White Workers" is now available on OSF. All comments are welcome!
What’s happening with the EEO-1 data? An update
Nearly four years ago I posted an update about the (un)availability of the Equal Employment Opportunity Commission's data. I promised that I would update when I have more information. I finally do. I post here the announcement that went out from Donald Tomaskovic-Devey, from the Sociology department at the University of Amherst, who has long been the point of contact between the EEOC and academic researchers:
After a hiatus of eight years the U.S. Equal Employment Opportunity Commission is making its incredible data resources available again to the research community. These data will be found in Census Research Data Centers and access procedures are detailed below.
These are establishment and firm level data on employment race-gender-occupation composition over time as well as discrimination charges filed with the EEOC and state Fair Employment Practice Agencies. Data are geocoded to the address level and contain company names. At the end of this email you can find some published examples of the use of these data. You will not find any examples of scholars who have used the data on labor hiring halls. Someone should be the first.
I am happy to answer any questions you have about the data that I can, but direct you to David Bowden at the EEOC for getting access and better answers.
With excitement,
Donald Tomaskovic-Devey
University of Massachusetts, Center for Employment Equity
It is also useful for people to have the original announcement from the EEOC, specifically from David Bowden, who is now managing the access team:
The EEOC is in the process of joining the Standard Application Process (SAP) portal hosted at https://www.researchdatagov.org/. We welcome and encourage researchers to submit applications via email to data.access@eeoc.gov. For projects using only EEOC data (along with any user-supplied data), we will be able to fully review and approve email submissions.
Proposals using data from both the EEOC and other federal agencies (see researchdatagov.org for a list of participating agencies and datasets) will eventually need to be submitted to the SAP portal as well so that each relevant agency can review the application. That said, submitting the proposal via email should slightly expedite the timeline for accessing the data as we will be able to review the EEOC-relevant portions of the application. Copying the application to the online portal should be fairly straightforward.
Once an application has been approved, the researcher will be contacted by Census Bureau staff to complete the other steps for obtaining access to FSRDC facilities, including gaining Special Sworn Status. This process may be expedited for researchers who already have Special Sworn Status from a previous project.
Please do not hesitate to contact us with any questions, as well as any requests to be removed from this mailing list. We also encourage you to forward this information to other researchers.
Best,
David Bowden, PhD
Statistician, Data Policy and Access Team
Office of Enterprise Data and Analytics
U.S. Equal Employment Opportunity Commission
(202) 921-2775 | david.bowden@eeoc.gov
This is very good news for the research community. It has been an incredibly long time since we had access to these data. Most publications that have come out using them in recent years have been those that, like mine, are based on data obtained before the hiatus was put into place and that therefore stop their time series in the mid-2010s. All sorts of questions, like what happened to employment segregation during the Trump years, have been left unanswered.
It is deeply bittersweet news for me, though. To work inside the Census Research Data Centers, you have to obtain Special Sworn Status. This is a non-trivial process, and one clear rule is that you have to be resident in the United States for at least three years, with a U.S. employer. Five and a half years ago, I moved to Canada to take a job at McGill. Thus I cannot obtain Special Sworn Status and cannot work on these data directly anymore.
I am enough of a control freak when it comes to research that this is maddening. I realize that all is not lost, though. This is an opportunity for me to forge more relationships with co-authors in the United States who can work with the data. It's at moments like this when I remind myself that Blau and Duncan published The American Occupational Structure without ever working directly with the Census data, instead designing batch jobs that were run at one remove.
I even have a research project that, were we able to get some other data, would use the EEO-3 data on union hiring halls that Don mentions in his email! So if you are an ambitious stratification researcher who is interested in getting access to and working in the Census Data Centers, please keep me in mind?
New working paper
I can't be the only academic who spent the pandemic doing extra administrative work, trying to keep the lights on. A depressing number of my current projects were started around 2019 and are only now reaching fruition. After the paper that Rem Koning, Sampsa Samila and I published in Science came out in the summer of 2021, I entered the longest dry spell of my career: nineteen goddamn months without a paper under review.
Finally, though, that dry spell is broken. Roman Galperin and I have been able to put our working paper on "Occupational licensure, race, and entrepreneurship" up on OSF and SSRN. May the rains continue!
Crosswalks, or Error Propagation
In the summer of 2021, my research assistant Hasnain Shaikh and I made great progress cleaning up the Unfair Labor Practice data that I have mentioned elsewhere. I have written before about the challenges with making sense of geography in these records, because the National Labor Relations Board switched versions of their coding schemes without documenting that. (Also they had a lot of typos on ancient keyboards.) Eventually we reached the point where we could make some simple plots showing the relative intensity of ULP charges in various states, with the states scaled by their union membership.
In those two images above, for example, you can see 1974 and 1984. The upswing of employer intimidation and firings for union activity, commented on at the time, is well under way. The production of the cartograms is a story for another time (we did these with QGIS, but dang we need a python-only toolchain!). Given that these were text strings of punched-card records not too long ago, though, I'll call this progress.
The real challenge in these data though is getting the coding of industry right. All industry-classification systems are compromises, but the challenge inheres in converting between them. Industry change means that these schemes need frequent updating. Said updates can be drastic; the shift from the Standard Industrial Classification (SIC) codes to the North American Industry Classification Scheme (NAICS) codes around the turn of the century famously breaks many longitudinal series. The further back in time you go, the worse this can be.
The data we use run through 1999, which spares us the SIC/NAICS mess, but there are still multiple versions of the SIC to deal with. For the most part, industry classification in the data look like this:
This isn't bad, and many crosswalks exist for moving between SIC classification schemes. I think it best to standardize these records to the 1987 SIC. This will produce the numerator for certain analyses. But what about the denominator?
When looking at geography, at least at the state level, a denominator is easy. Hirsch, Macpherson, and Vroman's work gives us state-level union density annually back to 1964. For industry, we need union density by industry. Here the troubles begin: the Bureau of Labor Statistics is very clear that they have only gathered consistent data on union membership and coverage by industry since 1983. Even where they do have it, they have used yet another scheme: the Census Industry Code lists. This means the data to be made commensurate really look more like this:
What is going on in the 1970s? There are references to data that the BLS gathered on union density between 1973 and 1983. In the data appendix to Farber et al.'s 2021 QJE (gated), for example, the authors write that "The CPS first asks respondents their union status in 1973, and then only in selected months until 1983 from which time information on union status was collected each month in the CPS as part of the outgoing rotation group supplement" (113). Shawn Perron's 2022 Social Currents (gated) has more information: "The [CPS] has asked about employment status, household demographics, and other labor force characteristics, such as union membership and wages since 1973. Extracts are compiled by the National Bureau of Economic Research (NBER). ... These estimates draw on the May extracts of the CPS between 1973 and 1982 and the merged outgoing rotation group (MORG) from 1983 onward" (375). Once you know what you are looking for, it is easy to find the CPS's archive of May extracts. Not for the first time on this project, once I have the right search terms, the data are trivial to grab; the problem is finding those search terms.
Consider what this diagram implies:
- Before 1973, no industry-density information is available
- 1973-1982, we must match 1972 SIC to the May CPS extracts
- 1983-1987, we must match 1972 SIC to 1980 CIC
- 1987-1992, we must match 1987 SIC to 1980 CIC
- 1992-1999, we must match 1987 SIC to 1990 CIC
Again, this isn't too bad. The good news is that, at least in recent years, CIC codes appear derived from NAICS codes. This gives me hope that the same will hold for the era covered by SIC codes.
Thus we could eventually do with industry what we have done with geography: plot counts of ULP charges across industry, scaled by industry density. (Whereas geography lends itself to cartograms, industry probably needs something like a mosaic plot.) But before I do so, it's worth discussing error propagation.
Error propagation in crosswalks
The social sciences (apart from some parts of psychology) don't spend much formal time on error propagation, or the propagation of uncertainty. I think that this is because direct measurement is so rare. We usually get our data from somewhere else. While we learn about measurement error, and understand that classical measurement error can introduce noise (but not necessarily bias) into our analyses, we spend very little time trying to quantify it. We also place too much emphasis on statistical significance per se, and downplay getting parameter estimates right. The situation is different in other sciences, where hypotheses are often about specific parameter values and where experimentalists often generate the data they analyze.
Assume we want to know how variable \( y \) responds to changes in variable \( x \) -- that old chestnut, correlation. Pearson's correlation coefficient \( \rho_{xy} \) is just \( \frac{cov(x,y)}{\sigma_x \sigma_y} \), that is, how much \( x \) and \( y \) vary together as a share of how much they vary individually. (Ever notice how much \( \rho \) looks like a Jaccard index?) Assume that we have classical measurement error; we measure \( x \) with a lot of noise. A noisy measure is going to have more internal variance, which increases \( \sigma_x \). Noise will also increase the variance of \( x \) relative to any given variation in \( y \), which by construction will decrease \( cov(x,y) \). Noise thus shrinks the numerator of \( \rho_{xy} \) while inflating its denominator. This is why classical measurement error biases coefficients toward zero.
So far, so good--this is basic stats. It's also why experimental scientists obsess over accurate measurement at every stage: measurement error in any stage of a variable's construction will propagate. Social scientists know that measurement error happens, but because we usually inherit our data from some other entity that did the primary measurement, we limit ourselves to discussing how it could affect our results. Most often, we assume that measurement error will bias our results toward zero, and therefore treat any results as conservative estimates of effects.
There is a place though where social scientists generate a lot of data even though they make no primary measurements: crosswalks. This is the generic term for our heroic and often flailing attempts to make incompatible datasets compatible. One of my heroes in this area is David Dorn, whose data page, replete with crosswalks, has bailed out countless researchers. (Seriously, people like David should set up Patreon accounts. Though perhaps citations do the same, in the currency of academic reputation?)
I want to distinguish two types of crosswalks. One is a straightforward recoding. You might have US states recorded as full names, two-digit abbreviations, or FIPS codes. This needs fixing, and there are whole packages for it, but there is no uncertainty in this process. I don't care about these. The second type is when we make guesses (educated guesses, but guesses nonetheless) about the code we should assign an observation in scheme 2, given the code it was assigned in scheme 1. There is ample room for measurement error here.
(Aside: crosswalks are just joins, to use the SQL term. There are a ton of different types of joins, though. The simple-recoding type of crosswalk boils down to a one-to-one join between two tables of codes. At the other end, many-to-many joins are often indeterminate. Because we use "crosswalk" to cover all of these, I think that the term can hide more than it reveals.)
I think the order in which you should build a complicated crosswalk is interesting. Say that you have data that's been encoded using three schemes, S70, S80, and S90, and that you want to standardize everything to S90. You could harmonize the S70 to S90 and the S80 data to S90. Alternatively, you could harmonize S70 to S80, then harmonize S70/S80 to S90. These aren't the same! Is there a formal way to choose between them?
I think that the answer lies in the data-generating process of the schemes themselves. If industry change is random then we can think about said change as retroactively applied mistakes in coding individual observations. This is a stochastic process, like Brownian motion. If I learned anything from Scott Ganz's work, it's that fixing a data point partway through a Brownian process reduces the variance of our estimates later on in the process. In these cases, it would make sense to harmonize S70 to S80, then harmonize S70/S80 to S90.
There are problems with that line of thinking, though. Even if the assume that the enumeration of possible industries in each scheme is appropriate (and we have little choice but to assume that), we don't necessarily know where to assign a firm in scheme 2, given solely its assignment in scheme 1. The best such crosswalks have probability weights that can be used to build transition matrices, but that means you're introducing assignment error at each stage. This is when we cycle back to error propagation. When are sequential probabilistic assignments with small(er) errors preferable to a single probabilistic assignment with large(r) errors?
This is a post that almost certainly will have an update. But, gosh, I'm curious whether other people have worked on this point.
Keep calm and ODBC
This was originally written on 14 April 2021.
This post has many parents. I want to update where I am on the project that started this blog: studying the diffusion of unfair labor practices in union settings. I also recently solved a problem that took far longer than it should have. I want to document what I did, to remind myself and maybe to help others. And I want to get back to blogging after a truly hellacious year, for me and for everyone. If a rant about database connectivity kickstarts things, so be it!
Changing languages
In late 2018 I discovered a trove of data on unfair labor practices dating back to the early 1960s. I had sought these data a decade earlier, without success. Now, here they were: copious megabytes of eighty-character punched-card records that needed deciphering and translating into a modern format. I saw this as a means for me to keep learning R.
In 2019, though, I used Python for more projects. Changes in my research also affected my choice of programming languages. Inequality and segregation research can privilege accounting statistics over inferential statistics. This makes the data-management capabilities of general-purpose languages relatively more valuable, and the installed base of routines in focused statistical languages relatively less. I work with masters-level research assistants at McGill, more than I do with PhD students, at least on my own projects; and masters students are more interested in learning Python than R. Thus last summer I chose to clean these data with Python. My existing R code base was small, thus switching costs were low.
The Pandas library expedites data analysis in Python. Something that kept me devoted to Stata for so long, though, is its felicity at cleaning archival data--finding typos, fixing data types, and the like. I've not yet found a Python library to speed up this process, though of course Python's strength lies in the ease of defining simple helper functions. (Such functions probably merit a post of their own.)
Where do you put the clean stuff?
Python it is, then. Coming from languages like Stata or R, though, it is striking how Pandas does not have a specific data-file type.
This is good and bad. I've struggled with proprietary data formats. In particular, I think SAS files should come with a trigger warning! CSV is so very slow, though, partly because most programs have to guess variable types while reading in data. Categorical data types can be extra problematic when you write data to plain-text files. If categorical value labels are long, file sizes can explode. You can write label dictionaries, but you must then manage multiple files. The same point holds for data dictionaries.
The best compromise is a relational database. The SQL database structure is non-proprietary and widely supported. Most statistical languages have built-in facilities for I/O with such databases. I've actually felt guilty for years that I don't store my most often-used datasets in such a format. Since Python doesn't give me an easy binary format by default, this seemed the right project to start with.
Writing a Pandas dataframe to SQL is straightforward. I find
SQLite much easier to use than, say, MySQL, since accessing files with SQLite does not require your mimicking a client-server interface on your local machine. Python has the sqlite3 library as part of the base installation. You just need to create a "connection" to the file through which you pipe SQL code:
import sqlite3 as sql
def CreateConnection(db_file):
    conn = None
    try: 
        conn = sql.connect(db_file)
        return conn
    except sql.Error as e:
        print(e)
    return conn
def CreateTable(conn, create_table_sql):
    try:
        c = conn.cursor()
        c.execute(create_table_sql)
    except sql.Error as e:
        print(e)Imagine you want to put four columns of a dataframe called "cases"
into a table called "ulp": _id, caseNumber, dateFiled, and
state. Create a connection to the relevant database file. Define the
SQL to lay out the table and pass it to CreateTable(), which puts
the table into the database file's schema. Then use the Pandas
to_sql() method to pass data from the dataframe into the database:
conn = CreateConnection('ULPs.db')
sql_create_table = '''CREATE TABLE IF NOT EXISTS ulp (
    _id INTEGER PRIMARY KEY,
    caseNumber TEXT,
    dateFiled TIMESTAMP,
    state INTEGER
    );'''
CreateTable(conn, sql_create_table)
cases[['caseNumber', 'dateFiled', 'state']].to_sql('ulp', conn,
       if_exists='append', index_label='_id')Here I have "state" stored as an integer. Really this is a categorical
variable, and I would like to keep the value labels associated with
it. I have a dictionary called state_map with those labels. Writing
this dictionary to its own table is simple enough:
sql_create_table = '''CREATE TABLE IF NOT EXISTS state (
    state_id INTEGER PRIMARY KEY,
    state TEXT
    );'''
CreateTable(conn, sql_create_table)
for key in state_map:
    sequel = '''INSERT INTO state(state_id, state) VALUES(?,?)'''
    c.execute(sequel, (key, state_map[key]))
    conn.commit()Notice here that sequel is a SQL command that itself takes arguments
(?,?). Thus c.execute() sends the command and a key-value pair
from state_map to the database. In other words, the variable
expansion here happens on the SQL side, not the Python side. Simple
enough, though passing variables defined in one language to commands
in another is tricky the first time!
Datasets that have many categorical variables tend to have many sets
of value lables. It would be better to define a loop to build all such
tables. I found this head-scratching enough that a solution seems
worth sharing. For each categorical variable foo I create a
dictionary, foo_map, with value labels. I then create a dictionary
of dictionaries:
table_maps = {'foo': foo_map, 'bar': bar_map, 'baz': baz_map, ...}I then put the preceding code into a loop that iterates over the key-value pairs in this dictionary. I must now include Python variables in the SQL code along with the SQL variables, and reference nested dictionaries:
for key in table_maps:
    sql_create_table = '''CREATE TABLE IF NOT EXISTS %s (
    %s_id INTEGER PRIMARY KEY,
    %s TEXT
    );''' % (key, key, key)
    CreateTable(conn, sql_create_table)
    for k in table_maps[key]:
        sequel = '''INSERT INTO %s(%s_id, %s) VALUES(?,?)''' % (key,
            key, key)
        c.execute(sequel, (k, table_maps[key][k]))
        conn.commit()This code rewards working through the variable expansions. And yes, it
 makes sense to build this code into a function, and then call the
 function for each item in table_maps.
At this point, somewhere, you have a SQLite database. Huzzah!
Fine, but how do you read it?
Now we get to the part that has driven me nuts multiple times: how do you read data from such a file?
In Python this is trivial: open a connection as above, define a SQL query, and pipe the results into a Pandas dataframe:
# Assume conn already exists
df = pd.read_sql_query("SELECT * FROM ulp", conn)In that case, you'd have the data that we wrote to the table ulp
previously. In R, presuming that you have installed the RSQLite
library, it's virtually the same:
conn <- dbConnect(RSQLite::SQLite(), "ULPs.db")
data <- dbSendQuery(conn, "SELECT * FROM ulp")And in Stata it's...oh, God damn it.
Stata lacks this type of SQL integration. Instead it uses open database connectivity or ODBC, "a standard API for accessing database management systems." Let me be clear that ODBC is a good thing. Out in the real world, having an interoperability layer between databases and applications is extremely important, and drivers are widely available. For an individual researcher, though, getting ODBC working just to read a SQLite file can feel like Carl Sagan's "To make an apple pie, first create the universe" line.
Stata has a series of odbc commands, but (per the nature of ODBC)
these just talk to the operating system's ODBC device manager. This
is where the fun starts. These days, OS X does not come with an ODBC
device manager; they dropped iodbc with Snow Leopard, because Apple.
There are two main ODBC device managers, unixodbc and iodbc. I'll
use the former, which you can install with Homebrew. (Don't have
Homebrew installed? Welcome to yak
shaving!
Thus in Stata I can set odbcmgr unixodbc (becuase Stata looks for
the erstwhile iodbc on Mac) and then use the ODBC device manager to
talk to the database. But the device manager does not automatically
how to do this. You must specify an ODBC device driver, which,
inevitably, is not installed. Things get squirrelly around device
drivers. You can find DevArt charging US$169 for a desktop license for
a SQLite ODBC driver; but you can also install sqliteodbc, again
from Homebrew, for free. YMMV, but for research projects, I think
you'll find the free version sufficient.
With these installed, you're almost there. However, unixodbc is very
Unix-y, in that it expects to find information in various
initialization files, notably odbcinst.ini and odbc.ini. These can
live in several places (type odbcinst -j in terminal to see where it
looks by default), but--in a grudging concession to usability--it will
preferentially use copies in your home directory. Mine look like this:
.odbcinst.ini:
[SQLite Driver]
Driver = /usr/local/Cellar/sqliteodbc/0.9998/lib/libsqlite3odbc.dylib
.odbc.ini:
[ULPs]
Driver = SQLite Driver
Database = /Users/jpferg/Dropbox/research/ulp_diffuse/data/ulps.dbOnce you have all of the things in the right places, then you can pull
the relevant data with odbc load, table("ulp"). Short, simple
command, but there's a pile of infrastructure behind it!
The curse of thin documentation
I swear, I didn't write all this just to complain. I've tried to figure out ODBC in Stata for years. Each time, I bounced off the complication of lining these pieces up. Multiple device managers, multiple device drivers, an implied client-server model on the local machine...I can perhaps be forgiven for throwing up my hands and sticking with .dta files.
Here's the thing though: when you look at what I wrote to read SQL database records into Stata, it doesn't look complex:
- Install unixodbcandsqliteodbcwith Homebrew
- Write ~/.odbcinst.ini. List where the device driver lives and give it an easy-to-use name.
- Write ~/.odbc.ini. List where the database file lives, the driver it needs (using that easy name), and an easy-to-remember name for the file.
Now you can type odbc list, odbc query, and other commands inside
Stata. These show databases you can access, which tables are in each,
the tables' structure and the like. odbc also lets you do database
I/O with SQL commands. Adding databases or drivers is just a matter of
adding lines to those files. The Python code I excerpted in this post
is way more complicated than the process I describe here. So why am
I complaining about this?
Though simpler, this process was much harder to learn. This is the
cause and the effect of Stata's fading prevalance. Stata is more than
30 years old; many of its user conventions are almost
pre-Internet. For eons the canonical place to trade information was
Statalist, an email list. The hardcore user community never migrated
to Stackexchange or other successors. Because new people use different
software, the posts that do exist for Stata/ODBC connectivity are
increasingly ancient. This doesn't cause problems for Stata itself
but, because odbc is separate software, changes that affect its
availability sow confusion. I mentioned for example that OS X dropped
iodbc from the base install with Snow Leopard. I learned that from
posts on Statalist about people trying to configure odbc on Mac. But
Snow Leopard was released twelve years ago!
Maybe Stata will include SQL support in version 17. They could bundle
their own device drivers. But why?  Almost anyone who uses Stata with
SQL databases already knows how odbc works, and new people are
unlikely to install Stata.
There is plenty of information about ODBC and SQL online, of course. But, crucially, there is little about using those them with Stata. That is where things break down. I don't want to learn a ton about ODBC--I just want to read a database! Yet Stata's own documentation stops at the modularity frontier.
As of this writing, nine of the first ten hits from a Google search of
"reading SQL with R" are third-party pages walking through ways to do
this. The oldest is 2015; most were posted within the last three
years. By contrast, three of the first ten hits for "reading SQL with
Stata" are from stata.com, and direct toward the odbc
documentation. Five of the remaining seven are from between 2014 and
2017, and virtually all of them are rants about how Stata does not
support SQL! One of the two remining is a query (from 2017) on
Statalist asking about reading SQL. Tellingly, the person on the list
who recommends odbc follows up with "I haven't used the -odbc-
command in many years--it just doesn't come up in my work. So all I
can tell you is that this is what -odbc- does. But I no longer
remember any of the details of how it is used."
The pernicious problem of the perfect paper
This was originally written on 7 July 2020.
Insisting that studies should not have errors or weaknesses handicaps scientific progress in organizational research. Let me explain.
Within this field, it seems bad form to point out the errors or limitations in prior studies. This is especially so when it comes to problems with the operationalization of the theory or weaknesses in the study design. You find rare instances of papers' pointing out a previous paper's mistakes but rarer still find papers' citing previous paper's empirical limitations. Yet this second kind of truth-telling is more important, because empirical limitations are more common than out-and-out mistakes.
In practical terms, you rarely read a paper by Carlson that says, "I am trying to do what Alicedottir & Bobson did but with a better research design." Saying this implies that Alicedottir & Bobson's paper wasn't perfect. If they had discussed their paper's limitations in their paper then it would be uncontroversial for Carlson to say that they were addressing those limitations. Indeed this would look like a contribution--Carlson figured out how to do something that had bedeviled earlier researchers! But because Alicedottir & Bobson didn't discuss their study's limitations, Carlson has to cannot just say that his paper addresses them. He has to first call attention to the limitation, along the way implying that Alicedottir & Bobson were either too thick to notice it or too mercenary to acknowledge it.
For Carlson, what should be a banal advancement of the science requires calling someone else out on their scientific practice. That's a very different rhetorical move. It feels all the riskier the higher status Alicedottir and/or Bobson are, and the more accepted their study is. Normal-science progress gets tangled up with critiques of scientists. Junior academics are too cautious about calling out their seniors, but you can understand their reticence.
If said call-outs are so rare, what does "progress" look like? I think this problem helps drive the "no one has looked at x" motivations of so many papers. This is a cardinal rather than an ordinal critique of the literature. It lets the author make a contribution without having to say anything bad about what came before. Carlson can write, "There's all this previous work out there, such as Alicedottir & Bobson, and it's of course amazing, but they didn't look at this other thing. Not that there's anything wrong with their not having looked at it. There's an infinite number of things to look at, after all."
I can imagine a counter-argument, namely that people actually frame their papers this way because of the field's unhealthy stress on theoretical innovation rather than on theoretical testing or strengthening. Perhaps--but whence that stress? There's a chicken-and-egg debate there that doesn't much interest me. Even if the habitual emphasis on innovation explains the general tendency, I don't think it explains why people hesitate in individual cases to criticize how prior theories were tested.
The Frontier of Knowledge
This approach ramifies through and taints other sections of the research paper. To see how, I need to reference a hoary metaphor: the frontier of knowledge.
I think of scientific progress like a line on a map, moving across and revealing once-unknown territories. (It'd be better to call it a high-dimensional manifold in a yet-higher-dimensional space, but come on.) Studies are like individual expeditions, advancing beyond that line (and thus redrawing it) along part of its length. In aggregate, that line advances. Advance comes in fits and starts, and weird salients can develop. Think of the state of physics versus the state of medicine in 1800! Any given advance fills in part of the map, but vast unexplored territories loom beyond the frontier. Contrast the first gringo to see the Grand Canyon to the first gringo to navigate its length. (And take my use of "gringo" as shorthand for all of the issues pertaining to the social structure of knowledge I'm eliding here.) It's a big deal to report back to people, "My dudes, there's a GIANT FREAKING CANYON out there!" That advances the frontier. And it poses the obvious follow-up: "Someone needs to explore that canyon!"
I like this metaphor because it underlines how all studies have limitations, and why it's fine to acknowledge them. "We have zero idea where this canyon ends. We weren't expecting it, and we didn't have enough supplies to explore it further." "We were actually studying migrating birds, and stumbled on this canyon while following them. We're not competent to do geology." And so on. The key point here is, do we publish this explorer's report? The answer to that question does not hinge on these limitations; it hinges on the importance of the contribution relative to these limitations. If we think the discovery of that canyon is a big deal, we're happy to publish it, even if the explorer can't tell us how the canyon was formed or where it leads. A major reason to publish the discovery is to call attention to this spot where they've advanced the frontier and rally people to explore the land beyond. But we rally people by calling attention to the unexplored frontier. "Who can say where this canyon leads, or what lies within?"
This is what a research paper is supposed to do:
- Define the frontier of knowledge. What have people discovered thus far? Notice that you're defining roughly the frontier where you are advancing, not the entire frontier.
- Describe your advance. Where did you go, and what did you find?
- Describe the territory ahead. Where couldn't you go, and why? Where do you think it would be best to explore next?
This metaphor reinforces the idea that science is, or is supposed to be, a collective, cumulative endeavor. We celebrate the advances and we implicitly recognize that each advance is partial. That partiality is not a mistake; it was the best people could do, and doing more is what keeps us going.
When papers have to be perfect, this all breaks down. The literature review of a "no one has looked at _x_" paper flails when trying to define the frontier. The author wants simultaneously to anchor their idea in an existing literature and to not bad-mouth that literature. Thus they cannot point to a few studies (i.e., a point on the map) and say, "Here. I am setting off from here." Instead they have to sift through a ton of related ideas and explain why their idea is slightly different. Often--and this is where the high dimensionality of the actual frontier becomes important--they have to explain why the work of an entirely separate population of explorers also didn't discover their idea. It would seem that this exercise would uncover a small niche, but the poor author also has to explain why this niche, hemmed in by so many similar and more famous discoveries, is also a big deal. It's a thankless task, and it helps to explain why so many literature reviews feel like the author fronting that they have Read the Right Things. Graduate students can be forgiven for scratching their heads about the point of literature reviews.
This is a shame, because literature reviews aren't pointless. (If you'd told me, seventeen years ago, that I'd write that sentence...) A good literature review tells the reader where you're setting out from. A good literature review explains where the frontier is, and why people think it's important to advance it. And a good literature review is short. I can't be the only person who found literature reviews the hardest part of any paper to write, because it felt like I was trying to distill all of my graduate education into six or seven pages. But that was never the right tack. I just needed to tell people what I was riffing off of, and why.
Thus literature reviews; but I actually think the problem of the perfect paper is most evident in the discussion section. Here is where you may balk: why am I complaining about how are papers do not acknowledge their limitations? We're all but required to discuss limitations! But the dirty, open secret is that most papers' limitations sections are total bullshit.
I use "bullshit" here in the sense that the philosopher Harry G. Frankfurt defined: an utterance that is unconnected to a concern with the truth. As an example, Frankfurt recounts Wittgenstein talking to a friend, Fania Pascal, who has had her tonsils removed. Pascal says she feels like a dog that's been run over, to Wittgenstein says, "You don't know what a dog that has been run over feels like." Assume Wittgenstein isn't just a coot; assume he thinks Pascal is bullshitting him.
[Why] does it strike him that way? It does so, I believe, because he perceives what Pascal says as being...unconnected to a concern with the truth. Her statement is not germane to the enterprise of describing reality. She does not even think she knows, except in the vaguest way, how a run-over dog feels. Her description of her own feeling is, accordingly, something that she is merely making up....
It is just this lack of connection to a concern with truth--this indifference to how things really are--that I regard as the essence of bullshit.
I posit that the vast majority of limitations that authors mention in published studies are this type of bullshit. They do not care whether and how these stated limitations affect their results; they care whether the reader (more specifically, the reviewer) is persuaded that they have devoted sufficient attention to limitations, as the norms of the field obligate them to.
How many times have you read a paper where the authors list as a limitation that they only looked at one organization, or one country? This is a bit like saying you only did one research project. Literally any study could be said to have this limitation, so listing it smacks of ceremonial compliance. Or how about the limitation that they haven't demonstrated the "mechanisms" that cause the effect? If the contribution is sufficient then that isn't a limitation; that's grounds for future research. Citing that as a limitation is the research equivalent of saying you're too nice or you work too hard in a job interview when you're asked about your weaknesses.
If limitations are often bullshit, the "future research" section is usually bullshit. To see what I mean, select ten articles, at least ten years old, by reasonably productive authors. Note the future research projects they propose in those articles. Search their CVs for those projects. I wager you'll largely search in vain. All future research needn't be done by the researcher who proposes it, of course. It is still striking how rarely researchers seem to care whether research ideas they propose are borne out. It is as though they are indifferent to whether the ideas they propose are really true. It is as though they are bullshitting us.
I don't think this is malicious bullshit. I've spewed bullshit myself, mostly in early papers. I didn't know what else to do! I was supposed to write the paper as though my study had no flaws or limitations, so obviously I couldn't suggest the most obvious types of future research: redo the study with better data, formally justify a way to measure something instead of relying on an ad hoc technique, and so on. Instead I was encouraged to think about applications of the idea or "approach" I'd developed in the paper. By definition, these could no more advance the frontiers of knowledge than the paper itself had done.
It's actually quite striking that theoretical approaches in organizational research aren't overturned so much as they're exhausted. Neoinstitutionalism, organizational ecology, much ado about status...they seem eventually to collapse under the weight of their encrusted scope conditions, boundary cases, elaborations, extensions, and theoretical curlicues. Yet along the way, the theories' proponents rarely addressed limitations that the theories had at the start. Once, after reading a mountain of organizational ecology, a graduate student asked me, "How would you measure structural inertia? Like, is there a way to rank organizations by how inertial they are?" It seems like a blindingly obvious thing to work on, but nearly no one has.
(I'm picking on org ecology here because I think its adherents do a better job of testing, critiquing, and building on its theories than most others. In grad school I heard a neoinstitutionalist sassing org ecology by noting how "all their studies ask the same question and get the same result." Sixteen years later I think, isn't that what we call replication?)
The key point I want to make here is that if you don't describe your advancing of the frontier of knowledge--the second of the three things a research paper should do--then the first and third things become, if not impossible, then ceremonial. The first time I understood one of my papers as trying to correct and build on a specific set of prior studies, I found that the lit review and discussion sections just...fell into place.
Where do we go from here?
This rant would be bitter and pointless without some recommendations. As it happens, I have several. Rather, I have one, but it applies to two different groups.
First, as researchers, we need to be honest about the limitations of our studies. We need to state what we can know and what we can't. When we point these things out, people can see how far they should trust our findings and where they could work on them. The ground truth is that all studies have limitations, just as all exploration is finite. Explaining where we did not go is useful in its own light, because it focuses our successors' attention on where they should explore next. Anyone who thinks that building on prior work devalues the original needs to go back to science school. When Newton wrote that "If I have seen farther, it is because I am standing on the shoulders of giants," no one said, "That Galileo--what an idiot!"
Second, as reviewers, we need to evaluate manuscripts on the magnitude of their contributions, and assess the relative value of their contributions and limitations. This just amounts to not punishing authors for their honesty. I find myself writing a lot of reviews of this type lately. I often point out that there is an empirical weakness of the paper, but not one that I think should block publication. Rather, I think the authors should be explicit about the limitation, both to qualify how we treat the contribution and to suggest how we could improve on it in the future.
Perhaps you think this would open the floodgates to mediocre or flawed research. I disagree. Right now, it seems that a lot of articles are evaluated on whether they contain flaws. If the reviewers do not see too many (And remember that authors have incentives to couch their studies so as to minimize the apparent limitations!), then accepts often follow. In such a regime, my suggestion opens the floodgates. But I am also calling for us to identify the contribution, and weigh these things. I have read many papers that make no real contribution, which I would thus have rejected despite their lack of glaring flaws. Too many of the "no one has looked at x" papers fit this bill.
I don't want to see more mediocre papers; I want to see fewer perfect papers. Because real science isn't perfect.
What’s happening with the EEO-1 data?
This was originally written on 13 May 2020.
I get asked this question enough that it seemed worth putting the answer in one place.
I work with the Equal Employment Opportunity Commission's EEO-1 establishment-survey data, which surveys large private employers about the racial and sexual composition of their workforces. Historically, researchers gained access to these data via the Intergovernmental Personnel Agreement. This is the law that governs secondment between federal agencies and other employers. Say that you work for the California Parks Service and are going to spend a year in Washington at the Department of the Interior. California is still paying your salary but you get treated as a federal employee for other purposes; this is the type of secondment the IPA handles. In our case, researchers were treated as Commission employees at a salary of $0 per year. We could access the data, and we could be punished to the full extent of the law if we were responsible for a breach.
As the federal government has tightened up its data-security policies, this access method moved out of compliance with the standards. In February 2018 the EEOC's Chief Data Officer announced a moratorium on new IPA agreements while the Commission put a compliant data-access regime in place. Those of us with existing IPAs have been allowed to continue working on the data through extensions.
In December 2018 the EEOC announced a partnership with the National Opinion Research Center at the University of Chicago to build a secure data enclave, through which researchers will be able to work on the Commission's data. The original plan was to bring on some existing researchers as beta testers of the new system in August 2019 and start accepting new applications for access in January 2020.
The beta version of the enclave went into testing in the fall of 2019. I was one of the beta testers. The new system is and will be a considerable improvement over how we accessed the data in the past. There were then delays, though, over beginning the access-request process. The Commission's CDO gave an update to the research community in early March:
In the course of putting the virtual data center together we learned that EEOC does not have any legal authority to share data with the “public” and according to agency lawyers, for Title VII purposes, researchers are considered the public. The IPA process had apparently been used as a workaround.
In a stroke of luck the recently enacted Evidence Act provides agencies with a way to share data with researchers. Under this law, approved researcher are deemed “agents” enabling them to use the data while imposing felony sanctions (jail time and large fines) for deliberate misuse of the data. These are stronger sanctions than the Title VII protections.
However the agency must be a statistical agency or be a designated statistical unit within an agency in order to avail themselves of this provision of the law.
We have decided - with the full support of the Chair - to seek official OMB recognition as a statistical unit for OEDA’s Information and Data Access Division. Just last week we received the authorization from the Office of the Chair to move forward with the application process.
Currently we have begun to develop the application and expect to submit it to OMB in May. Our hope is to receive approval by late summer. We’re hoping to have everything up and ready for researchers in late Fall.
Once and for all we will have in place the actual legal authority to share data, the data sharing process will be governed by the same CIPSEA provisions governing other federal statistical agencies, the application for access to a secure virtual research data center will be open to any qualified researcher, and appropriate strong disclosure controls will be in place.
Please thank all of the researchers for their patience. A superior long term solution for all will be in place once we’re finished.
Of course, within hours of this email, the shit hit the fan with respect to the pandemic.
So, that's what's happening with the EEO-1 data. At present, the EEOC is not accepting applications for access. They probably will be before the end of the calendar year. How much of a wrench COVID-19 has thrown into the works I cannot say, but everyone involved has been acting in good faith.
I write all of this here because, every few weeks, I get an email from an academic colleague asking what the deal is with accessing these data. This is reasonable--I'm happy to be one of the "authorities" on these data! But rather than typing the same thing over and over, I thought I'd just write it once and direct people to it.
Obviously, the second I know more, I'll write a new post!
Refactoring a project
This was originally written on 29 April 2020.
What I did this spring
I'm new enough at blogging that I need to give an explanation if I drop of for a few months. In January, I became the academic director of McGill's MBA Program. That's been busy enough, but of course seven weeks ago everything went sideways with COVID-19. I serve on the executive of the faculty's emergency operations committee, with all that entails. The remainder of March and most of April were a lost cause for research. In the last two weeks I've been able to scrape together a few minutes here, an hour there, to nudge projects forward again. I've learned that I can do things like copy-editing and file management while listening in on meetings that I'm not actively leading.
Like everyone else, I want to see this pandemic in the rearview mirror. But my job is secure, I can do most of my work from home, and I have no small children that I have to keep entertained. I'm lonely and stir-crazy, and I know I'm one of the very fortunate ones. Thus I won't use this space to whine about lockdown. Surely we're all bored of talking about it, right?
The trials of organization
Instead, let's talk about replicable research. I think that it's important for other people to be able to reproduce your results, but the main reason I care about replicability is so I can reproduce my own results. I've found that I need to stop at certain points in a project and clean up, not just the code in files, but also the file structure itself. My rule is that the first time I have trouble remembering which file I'm supposed to be working on, it's time to clean up the project directory. Trust me: if you wait until the "end" of the project, it will be too late.
I use a standard directory structure for my research projects:
project_name
   |--canonical
   |--data
   |--documentation
   |--drafts
   |--figures
   |--literature
   |--presentations
   |--scripts
      |--old_code
   |--tablesA lot of people use something like this. /Canonical holds original data and ideally is read-only except when explicitly adding new files. This is the stuff that, if I lost it, I'd be back to square one. /Data holds any data that I create. I think the distinction is important. I never write output to /canonical because it's just too easy to overwrite or erase things. If I've done things correctly, /data could be erased and I could be up and running again in the time it takes to run my code. Documentation has how-tos of various sorts. These might be software manuals, dataset codebooks, methodological pieces, and the like. These are separate from /literature, which has phenomenological pieces related to the project. Typically, if it will wind up in the lit review, it goes in /literature; if it will wind up in a footnote or appendix, it goes in /documentation. The distinction between these two directories is the least important.
I keep all of my code in a directory called /scripts. That name probably says a lot about me. First, I started doing research before it became chic to talk about "code" and "coding" all the damned time. Second, I've spent too long in programs like Stata, and I maintain a mental distinction between "scripts," where I invoke existing commands, and "programming," where I write my own commands to give me functions beyond those built in. I realize this is a hoary and obsolete distinction. But let me have my old-coot-dom.
/Drafts is where the paper drafts go. /Figures and /tables hold the obvious things. Note that TiKZ drawings live in /figures both as TeX and as PDF documents. Finally, /presentations has the inevitable PowerPoints. I maintain a separate directory for presentations because I often have ancillary graphical elements and perhaps custom-built figures for presentations, and it's easier to keep those in one place. LaTeX is fine with crawling across a directory structure for its needed elements, and there are traces in the files themselves of what resources are needed. Because PowerPoint embeds things, it's hard to reverse-engineer where the elements in a presentation came from.
Other people use other structures, but almost all productive people have some structure. The advantage of using the same skeleton across all projects is that you don't have to spend mental energy remembering where you put things on this one.
This skeleton also implies a rough workflow. Write scripts that read from /canonical and write to /data. Write other scripts that read from /data and write to /figures and /tables. Reference those outputs in a document that lives in /drafts. Store the scripts in /scripts. Rinse; repeat.
In practice, it can be hard to follow this. Even if you save files in the right places, you can go down blind alleys and have to restart. But what do you do with that old code? Maybe you write a temporary file at some point because the code has grown too complex, or because some programming step takes a long time to run. Do you keep that file in the workflow, or do you refactor things to make it unnecessary? This is where the next step comes in.
Projects as directed graphs
When I can no longer easily remember how the files of one of my projects fit together, I map dependencies. If I already have a draft paper, that's a good place to start. What files does it call? What files generate those files? What files generate those files? And so on, back to the canonical data. I mark up these connections freehand, in a notebook or in a text editor. Then--and this is probably where I reveal myself to be an obsessive--I write it up formally in the dot language, as a directed graph. I do this because I can then visualize the dependencies, which helps me immensely.
One of my running projects is on spatial metrics of employment segregation. I hit the wall last night, so today the directed graph came out. That produced the following code:
digraph S {
        "S/spatial_segregation_scripts.py" [shape=box];
        "S/spatial_v_aspatial_theils.py" [shape=box];
        "S/pyloess.py" [shape=box];
        "D/geocoded_YEAR_short.csv" [shape=box,color=blue];
        "D/spatial_figs.csv" [shape=box,color=blue];
        "F/spatial_win_btw2.pdf" [shape=box,color=green];
        "F/overstate.pdf" [shape=box,color=green];
        "S/overstatement_figure.do" [shape=box];
        "E/eeoc_addresses_census_school.csv" [shape=box,color=red];
        "S/build_geocoded_years.do" [shape=box];
        "E/EEO1_1971_2014.dta" [shape=box,color=red];
        "D/eeo_YEAR_short.csv" [shape=box,color=blue];
        "D/geocoded_YEAR.dta" [shape=box,color=blue];
        "E/MATCHING_XWALK_EEO1_1971_2014.dta" [shape=box,color=red];
        "F/overstatement.pdf" [shape=box,color=green];
        "D/matchrates.dta" [shape=box,color=red];
        "S/matchrates.do" [shape=box];
        "F/matchrates.pdf" [shape=box,color=green];
        "F/kdensities.pdf" [shape=box,color=green];
        "D/spatial_figs.csv" [shape=box,color=blue];
        "S/spatial_segregation_scripts.py" -> "S/spatial_v_aspatial_theils.py";
        "S/pyloess.py" -> "S/spatial_v_aspatial_theils.py";
        "D/geocoded_YEAR_short.csv" -> "S/spatial_v_aspatial_theils.py";
        "S/spatial_v_aspatial_theils.py" -> {"D/spatial_figs.csv";
                "F/spatial_win_btw2.pdf"; "F/overstate.pdf";
                "S/overstatement_figure.do"}
        "E/eeoc_addresses_census_school.csv" -> "S/build_geocoded_years.do";
        "E/EEO1_1971_2014.dta" -> "S/build_geocoded_years.do";
        "S/build_geocoded_years.do" -> {"D/eeo_YEAR_short.csv";
                "D/geocoded_YEAR.dta"; "D/geocoded_YEAR_short.csv"}
        "E/MATCHING_XWALK_EEO1_1971_2014.dta" -> "S/build_geocoded_years.do";
        "S/overstatement_figure.do" -> "F/overstatement.pdf";
        "D/matchrates.dta" -> "S/matchrates.do";
        "S/matchrates.do" -> {"F/matchrates.pdf";
                "F/kdensities.pdf"}
        "E/eeoc_addresses_census_school.csv" -> "S/matchrates.do";
}Don't get hung up in the code. Consider the output:
Red boxes are canonical data, things that are not generated by any of my code. Black boxes are scripts. Blue boxes are generated data, while green boxes are figures. The capital letters stand for /scripts, /data, /figures, and /elsewhere. (The latter would normally be Canonical, but in this case the original data lives in a protected directory separate from the project itself.) 
This mostly follows that workflow I mentioned, but there are weird exceptions. The script build_geocoded_years.do writes a couple of data files that aren't used by anything downstream; so does spatial_v_aspatial_theils.py. There are two figures, overstate.pdf and overstatement.pdf, that look suspiciously similar. At first I thought I'd just typed something wrong, but no, they're both in /figures. The biggest problem though is in the upper right. The file matchrates.dta lives in /data, yet there is nothing that generates it! It's an orphan, and yet a script then calls it to create two figures that are in the paper. This is a problem. Without that file, or at least a record of where it came from, you cannot replicate these results from the original data sources.
Orphan and widow files like these jump out when you code up a directory this way. So to do those bugbears of reproducible research, the circular dependencies. Now I can clean them up:
- The extra data files written by build_geocoded_years.doare intermediate files in route togeocoded_YEAR_short.csv. (There are actually 41 such files; the YEAR in caps is my way of noting a variable in a loop.) Intermediate files are especially common in Stata, where before version 16 you could only hold one dataset in memory at a time. It's just important to include the relevanterasecommands at the end of that script to get rid of them.
- The spatial_figs.csvandoverstatement_figure.dothat come out ofspatial_v_aspatial_theils.pyare there because of one another. When I was first trying to plot the overstatement of aspatial (vs. spatial) employment segregation, and how to fit a locally weighted regression line through the points, I wrote a quick-and-dirty Stata script to do it. That script needed the data, so I had the python script write a CSV file that Stata could read. Later, I figured out how to plot this graphic directly in python (that's why the script callspyloess.py). Thus all of these excess files can be removed, and the script that produced them can be cleaned up.
- Some experimentation revealed that the data in matchrates.dtawas probably produced at one stage inbuild_geocoded_years.do. I have no memory whatsoever of doing this, but there was a fragment of commented-out code in the latter file, so there you go. I altered that code to reliably generate the needed data, and moved the figure-plotting code frommatchrates.dointobuild_geocoded_years.doitself. That eliminates the need for the intermediate data file and the subsequent script.
The new dependency structure appears above. The only deviation from that idealized workflow is where one python script calls others. But that's OK, since the scripts called just define functions used in the calling one.
These aren't gigantic changes, but that's kind of the point. If you stop and do this every now and then, before things get too convoluted, it isn't a big deal to tidy things up. If instead you wait until a deep clean is necessary--if you wait until you have to Marie Kondo your project directory--this can become a true nightmare. Lest I seem pretentious, let me end with a similar file map that I created while trying to clean up an earlier project, before stopping in despair:
Delay refactoring at your own risk!
Fun with Modulo and Boole
This was originally written on 19 January 2020.
Segregation research has what is called the "Checkerboard problem." Back in 1983, Michael White pointed out that most segregation measures look at how evenly distributed groups are between sub-units in a larger unit, like neighborhoods in a city, but ignore the spatial arrangement of groups within and between those units. The problem's name comes from imagining a city where each census tract looks like a checkerboard. Assume the city is half white and half black. A census tract that has houses arranged in the classic checkerboard pattern would be half white and half black, like the city as a whole; but so too would a tract where all the squares on the left-hand side of the board are white and all the squares on the right-hand side are black. The "problem" here is that traditional segregation measures would say that both of these tracts are unsegregated, but the spatial arrangement tells us otherwise:
One way around this would be to look at segregation between sub-tracts, but this has practical limitations. Data are only recorded to a certain degree of resolution most of the time. Instead, it makes sense to develop spatially weighted measures of segregation, as Reardon and O'Sullivan did in a 2004 paper. In that paper, they presented a nice graphic that shows the different dimensions of spatial segregation, contrasing Evenness/Clustering versus Isolation/Exposure:
I'm working on a paper about spatial segregation, and I wanted to reproduce this diagram. The most obvious way to do that for most people would be to fire up PowerPoint, get to copying and pasting, and eventually save the result as a graphic you could paste into Word or even a LaTeX document. Sure, you could do that, but the previous posts on this blog strongly imply that I'm not most people. Implementing this programmatically, as a vector graphic, seemed like a fun challenge.
This meant it was time for TikZ. For the uninfected, TikZ is a way of building vector graphics that plays extremely well with LaTeX. Its name is a recursive German acronym for "TikZ ist kein Zeichenprogramm," or "TikZ is not a drawing program." This passes for humor among some nerds. I've used TikZ on and off for a decade. When everything works, it can produce some truly lovely and clean data-driven graphics. But it is always a pain in the ass. Why do I keep coming back to it? Because each time, I learn a little more, and I'm addicted to that feeling.
The checkerboards in the first figure, for example, I drew in TikZ. And the code to do so is genuinely not that complicated:
\documentclass{standalone} %
    \usepackage{tikz}          % Vanilla headers from LaTeX
    \begin{document}           %
    \begin{tikzpicture}[scale=0.5]
    \foreach \x in {0,...,7}         % Loops! We all know loops!
      \foreach \y in {0,...,7} {     % Iterate to draw a checkerboard.
        \pgfmathsetmacro{\color}{iseven(\x+\y) ? "black" : "white"} % ifthenelse
        \draw[fill=\color] (\x,\y) circle(0.4cm); % Conditional fill
      }
    \begin{scope}[xshift=9cm] % Move all reference right of what we've drawn
    \foreach \x in {0,...,7}                     % More loops!
      \foreach \y in {0,...,7} {                 %
        \pgfmathsetmacro{\color}{\x>3 ? "black" : "white"} % ifthenelse again
        \draw[fill=\color] (\x,\y) circle(0.4cm); % Conditional fill
      }
    \end{scope}
    \end{tikzpicture}
    \end{document}For the most part, that code is going through loops to draw circles: the inner \foreach \y loop builds a column of circles, and the outer \foreach \x loop builds a row of these columns. Probably the most offputting lines in there are the ones that start with \pgfsetmacro. And they should be offputting! Those lines are your first indication that the whole TikZ language sits on top of another language: PGF, or the Portable Graphics Format. Confused yet?
Here's the thing about PGF, in my opinion: it's hot garbage. Or rather, it's an astonishingly low-level language, given how high-level TikZ itself is. This means that, whenever you have to do something tricky in your TikZ drawing, you punch through the glossy front end and fall a long way, down to a programming language that borrows from something clever that Don Knuth did in the 1970s.
(Before I go on, let me clarify how the if-then-else operator works in PGF. a ? b : c means "If a is true then b, else c." Therefore here \pgfsetmacro is setting the value of a variable (or "macro"--Stata users brrrrrrup brrrrrrup!) called color based on the if-then-else condition in the following set of braces. In the first case, I'm adding the x and y indices of each circle, starting at the lower-left of the board, and coloring the circle black if the index is even. In the second, I'm coloring all circles in the column black if the column index is 4 or greater.)
Clock math
The upper panels of Reardon and O'Sullivan's figure aren't that hard. They're basically repeating patterns in one dimension. More or less every third circle in the upper-right is black, for example. It gets a little janky as you move between rows, but we can work with that.
As soon as you hear "regular repeating pattern" you should think loops, right? NO. I've seen people in my field do that before, but--especially when you already have numbers you're looping over--you should be using the modulus function. This is not the place for an introduction to finite math, but if you want to be a quantitative social scientist and haven't spent much time with the modulus function, you have a homework assignment.
Modular arithmetic is just math done on a clock face. Everyone understands that three hours after 10 o'clock is 1 o'clock. That means that everyone understands that \( 10 + 3\, (mod\, 12) = 1 \) or that mod(10+3,12)=1. Modular arithmetic shows up all over the place, especially in cryptography (Diffie-Helman key exchange, anyone), but it's also a bog-standard tool for generating repeating sequences. Thus the code for generating the upper-right checkerboard isn't that much more complicated than our most basic checkerboard:
\foreach \x in {0,...,9} 
        \foreach \y in {0,...,11} {
          \pgfmathsetmacro{\shift}{mod(\y,2) ? .5 : 0}
          \pgfmathsetmacro{\bump}{mod(\y,2) ? 0 : 1}
          \pgfmathsetmacro{\color}{mod(\x+\bump,3) ? "white" : "black"}
          \draw[fill=\color] (\x+\shift,\y) circle (0.4cm);
      }There are only a few changes. I set a macro shift that scooches the odd-numbered rows over half the distance of a circle, to resemble the original. (Remember that all the row and column indices start from zero, and they start at the lower-left--I wasted a couple of hours on this code before I remember that I shouldn't be numbering and counting from the upper-left!) This means that shift affects the placement of the circles but not their color. Note that it isn't invoked until the actual \draw command. Meanwhile bump gives us an offset so that which exact circles are black alternates between rows. Alternating rows are actually only offset by one, but when combined with the shifting of the rows it looks like more. 
Pay close attention to how the mod() operator works here. Row indices are being mapped onto a two-digit clockface, so \( [0,1,2,3,...,8,9,10,11] \rightarrow{} [0,1,0,1,...,0,1,0,1] \). This is how mod(\y,2) lets me flag alternating rows. Then within the rows, the column indices are being mapped onto a three-digit clockface. On the even-numbered rows, \( [0,1,2,3,4,5,6,7,8,9] \rightarrow{} [0,1,2,0,1,2,0,1,2,0] \). On the odd-numbered rows, because of the bump, \( [1,2,3,4,5,6,7,8,9,10] \rightarrow{} [1,2,0,1,2,0,1,2,0,1] \). If you look closely, you'll see that the zeros in each of those right-hand sequences corresponds to a black circle. This is what that last if-then-else statement in the code does for us: any non-zero result receives the color "white" while zeros receive the color "black."
The upper-left panel of the figure works the same way, with different values for \bump and for the base of the modulus (Bonus points if you can figure them out). I had these two figured out within an hour of a cold start. The bottom two panels are more complicated, though. notice that they have multiple patterns, depending on the row index. And that lovely if-then-else operator, a ? b : c, does not nest.
If I were going to make those lower panels, I was going to have to spend more quality time with PGF.
This is where the whining starts
Let me note that the idea for this post was born while I was wrestling fruitlessly with row and column indices within modular arithmetic earlier today. I kept getting bizarre results that were making me doubt my understanding of PGF, and possibly of my sanity. Sometimes the resulting pictures only made sense if you started enumerating indices from 1 rather than from 0; sometimes it looked like the if-then-else operator returned the first value for zero results and the second for everything else. Because the modular base is small (i.e., two), it was easy to see the results as flipped rather than offset. I pulled on my hair a lot this afternoon, and my confusion was not helped by the syntax of PGF.
Consider the lower-right panel. Here there are three different line patterns. The first row is all white, the second has every third circle colored black, and the third row has every third circle offset by one colored white. The classic way to deal with three cases is a nested if-then-else: a ? b : c ? d : e. Alternatively you could write that as a ? c ? d : e : c, but it doesn't matter because neither works. Instead, you need to use a ver low-level PGF command, \ifnum.
This routine has some quirks. For starters, it only accepts integer arguments, which means that the floating-point resulds of, say, the mod() function will produce errors. Hence int(mod()) will be necessary. Nor can you pass functions as arguments to \ifnum which can then be evaluated to produce variables. Instead the functions have to be evaluated outside the loop and passed to it as \pgfmathresult. As you might guess, each of these quirks took some time to sort out...
Eventually I could produce the lower-right panel, including the offset of the alternating "sets" of triangles. This was eased by the fact that the logical tests that you have to pass to the if-then-else operator are not that complicated: either a circle's index modulo 3 is zero or it isn't. If it is, color it black in one case, white in the other.
But with that said, turn your eyes back to the lower-left panel of Reardon and O'Sullivan's diagram. Consider in particular the third row up from the bottom, the one with two black circles filled in. Here a circle should be black if its index modulo 5 is zero OR if its index modulo 6 is zero. Suddenly the logic becomes more complicated!
It will not work to test for whether one condition or the other is satisfied here. But to understand why, we need to spend a little time with Boolean algebra.
Once again, here I describe something that people with a background in mathematics or computer science will find very basic. But I stress that, in most of the social sciences, this sort of thing is maybe mentioned in an early class on probability theory and then never disinterred. Furthermore, the way these things work in the context of a programming language are never discussed.
Take the humble or() operator. In Boolean algebra, \( or(a,b) \) evaluates as True if either \( a \) or \( b \) evaluates as true. Yet in many programming languages, including PGF, or(a,b) evaluates as True if either a or b returns a non-zero value. This is a problem for the modulus procedure, because most combinations will return a non-zero value. For example, or(mod(\x+2,6),mod(\x+1,6)) will return a True value on every index, because either one of the moduli is non-zero or both are. I need to fill in circles with black only when one of those conditions is non-zero and the other one is not. This is the domain of xor(), the exclusive-or operator. Which has all sorts of uses, especially in binary arithmetic--but PGF does not have an xor() operator!
That said, the point of Boolean algebra is to be able to build more complicated logical expressions from simpler ones. If you play around for a while, you'll realize that \( XOR(a,b) \equiv AND(OR(a,b),NOT(AND(a,b))) \). I used that to generate that row. But then I had to handle the row above it, where there are three tests to run, and where I want to shade a circle if only one of the three moduli returns zero. This sort of test can be implemented as nested exclusive-ors: xor(xor(a,b),c). But what does that look like if you do not have a native exclusive-or operator? Behold:
\( AND(OR(AND(OR(a,b),NOT(AND(a,b))),c),NOT(AND(AND(OR(a,b),NOT(AND(a,b))),c))) \)
I will admit that I patted myself on the back when I typed that in and it worked the first time! Thus the code for the lower-left panel:
\foreach \x in {0,...,9}
        \foreach \y in {0,...,11} {
          \pgfmathsetmacro{\shift}{mod(\y,2) ? .5 : 0}
          \pgfmathsetmacro{\color}{"white"}
          \pgfmathsetmacro{\testx}{\x+3*int(\y/6)}
          \pgfmathparse{int(mod(\y-1,6))} % Tip of triangle
          \ifnum\pgfmathresult=0 
            \pgfmathsetmacro{\color}{mod(\testx+2,6) ? "white" : "black"}
          \else
            \pgfmathparse{int(mod(\y-2,6))} % 2nd of triangle
            \ifnum\pgfmathresult=0
              \pgfmathsetmacro{\color}{
                and(or(mod(\testx+2,6),mod(\testx+1,6)),not(
                and(mod(\testx+2,6),mod(\testx+1,6)))) ? "black" : "white"}
            \else
              \pgfmathparse{int(mod(\y-3,6))} % 3rd of triangle
              \ifnum\pgfmathresult=0
                \pgfmathsetmacro{\color}{
                  and(or(and(or(mod(\testx+3,6),mod(\testx+2,6)),not(and(
                  mod(\testx+3,6),mod(\testx+2,6)))),mod(\testx+1,6)),not(and(
                  and(or(mod(\testx+3,6),mod(\testx+2,6)),not(and(
                  mod(\testx+3,6),mod(\testx+2,6)))),mod(\testx+1,6)))
                  ) ? "white" : "black"}
              \else
                \pgfmathparse{int(mod(\y-4,6))} % top of triangle
                \ifnum\pgfmathresult=0
                  \pgfmathsetmacro{\color}{
                    and(or(mod(\testx-1,6),mod(\testx-2,6)),not(
                    and(mod(\testx-1,6),mod(\testx-2,6)))) ? "white" : "black"}
                \fi
              \fi
            \fi
          \fi
          \draw[fill=\color] (\x+\shift,\y) circle (0.4cm);
      }Hopefully we can all agree that the readability of the code suffers somewhat in this case. Yet at that point I could reproduce the entire figure in code:
Am I just an idiot?
Literally as I wrote this blog post, complaining about the lack of an XOR() operator in PGF, I realized that you could do what I needed without calling an XOR(). You see, you can specifically test whether the modulus evaluates to zero. Thus instead of specifying
and(or(mod(\testx+2,6),mod(\testx+1,6)),not(
    and(mod(\testx+2,6),mod(\testx+1,6)))) ? "black" : "white"}I could instead write
or(mod(\testx+2,6)==0,mod(\testx+1,6)==0) ? "black" : "white"}Similar shortening could be done for the three-way test. This thus raises the question of whether I'm just an idiot.
But let's take a step back. Am I not already an idiot for spending literally a day hacking away at this, when I could have produced it in an hour or so in something like PowerPoint?
I wrestle with this question all of the time. I like going down rabbit holes when I am working on things, and I like understanding how things work. I started trying to implement this figure in TikZ in the first place because I'd never really mastered the language's looping and conditional structures, and it seemed like a chance to learn those. I've definitely learned a ton! But I always have to balance that learning against the opportunity cost. Should I have gone skiing today instead?
I once discussed TikZ with some PhD students at Stanford while Mike Hannan was in the room. Mike told the students in no uncertain terms that, cool as this stuff was, while you are writing the paper you should make a sketch on a piece of paper and then set it aside. Only at the very end should you wrestle with figure generation, because it's such a time sink. I recognize the value in that injunction, and in general I try to follow it. But sometimes it's down in these rabbit holes that I learn things that speed up my work somewhere else. I use modular arithmetic all of the time, for example, not only in data analysis but even in laying out tables in Salesforce to help our staff award students financial aid.
Perhaps this has become a blog whose posts are triggered by my obsessions. But even then at least it may have the value of showing people how much work exists behind the work, and convincing you other obsessives that you are not alone!
A Christmas miracle: New working paper
This was originally written on 20 December 2019.
As mentioned in the last post (which, dear God, was more than two months ago!), I've been underwater for most of a term. Which is a thing that happens to professors sometimes, and I needn't apologize for letting blogging slip. Not least because today, less than 48 hours after I posted the grades for my last classes, I finished the first complete draft of a paper that we've been working on, in fits and starts, since February 2018.
"Industrial Change, the Boundary of the Firm, and Racial Employment Segregation," with Rembrand Koning
We showed in an earlier paper that racial employent segregation between American workplaces rose over the last generation. A lot of people asked us if that was an industry story. Had the decline of manufacturing and the rise of services caused this, for example?
Short answer: not really. In particular, you can't get much of a handle on patterns of racial employment segregation if you do your thinking at the level of the sector or the industry. You have to focus on the structure of the firm, specifically which jobs are done by the firm itself and which jobs it externally contracts for. Given that we know there's occupational stratification by race in America, and given that the last generation has seen a wave of companies' choosing to contract for many business services that would have once been performed by workers on their books, it should come as no surprise that segregation between but not within detailed industries has been on the rise, across most economic sectors.
This project took a lot of back-office work; I wrote about part of it on this blog earlier this year. And this is very much a version-1.0 draft. But ça existe! I can link to it! And with luck, within a few months we'll be able to get it under peer review.
...and that, my friends, is the true meaning of Christmas.
Why I love Emacs sometimes
This was originally written on 8 October 2019.
I've not updated this blog in more than two months. Have I lost interest? Not at all; life got in the way. Since the middle of August, I have done some stuff:
- Attended the Academy of Management, my field's annual conference, in Boston
- Wrote and submitted my promotion packet at McGill
- Prepped a new (to me) class, the core MBA course on Organizational Behavior
- Taught that class in a four-week, compressed format, one that had me in the classroom as many as 20 hours in some weeks
- Started teaching the part-time version of that MBA course in the evenings
- Taught the first half of my undergraduate core course in the same subject
- Wrote a large grant application to fund my ongoing research
- Slept, bathed, etc.
In other words, Internet, forgive me, I've been busy. At last the crush has begun to ease, and I begin to see slivers of unstructured time reappearing in my calendar. To the blogs!
Actual content
When I moved from Stanford to McGill, I was startled by how differently the two schools handle their exams. Stanford has a strong honor code; McGill has a strong police state. In Quebec I first heard the (admittedly awesome) term "invigilators," the personnel who perch at the front of an examination room and, in extremis, even accompany students to the bathroom during tests.
I mustn't let my dislike for this system hijack this post, so I'll zoom in on one aspect: preparing multiple-choice exams. For classes above a certain size, professors must submit multiple versions of any exams that have a multiple-choice component. The rationale is that, while students are supposed to sit with an empty chair between them, sometimes (Horrors!) they have to sit side by side. In such instances, they should have exams that present the questions in a different order, to complicate copying.
Obviously this system incentivizes professors to use simple multiple-choice exams wherever possible, because they're easy to score, easy to recycle, and can be protected from catastrophic cheating by the police-state procedures. Which might feel like pedagogical laziness, but you can always resolve that cognitive dissonance by choosing to belive that your students are devoted to cheating wherever possible...
...Dang it, that's a hijacking. Let me start again.
Presume you have a multiple-choice component on your exam. You've written a document that has each question along with five possible answers:
1. This is a question
a) This is an answer
b) This is another
c) You get the idea
d) Hoo boy
e) It gets tedious
2. This is another question
a) In case you weren't sure
b) Presume there's five answers
3. This is a third question
a) There actually aren't answers
b) Learning that is part of growing upYou must now create multiple versions of this document, each with the questions in a different order, but each sequentially numbered.
This sort of situation, where paranoia mates with technophobia to breed stupidity, drives me up the wall. I once had someone advise me to create a second version by moving the first question to the end and then renumbering everything. Can we agree that that would be ceremonial compliance, which neither solves the problem nor questions the system that poses the problem in the first place? Alternatively, some faculty recommended an application that shuffles such questions and outputs the resulting text files. This special-purpose application of course only exists for Windows machines...and wants you to pay actual money for it.
I haven't figured out how to smash the system yet, but I have figured out a solution to the problem. Inevitably, it uses Emacs. I write it up here in part to remember how I did it, but also because it might be worth sharing.
To review some conventions about representing keystrokes in Emacs:
- C-xmeans "hold down- CTRLand press- x"
- M-xmeans "hold down- ALTor- OPTIONand press- x"
- SPCis the spacebar
- RETis Enter or Return
I'll presume that you know what regular expressions are, even if
you're not super familiar with their syntax. Note that you represent a
carriage return in a regular expression with C-qC-j; that is, you
hold CTRL and press q then j. Note also that, because Emacs LISP
uses a ton of parentheses, in Emacs you have to escape parentheses
used in regular expressions.
THUS: Write the exam as you normally would, numbering each question as "xx":
xx. This is a question
a) This is an answer
b) This is another
c) You get the idea
d) Hoo boy
e) It gets tedious
xx. This is another question
a) In case you weren't sure
b) Presume there's five answers
xx. This is a third question
a) There actually aren't answers
b) Learning that is part of growing upFirst, turn each question into a single line of text. This is done
with Emacs's replace-regexp command, which is linked to C-M-%:
C-M-% C-qC-j\([abcde]\))C-qC-j? RET ␣\1) RET
xx. This is a question a) This is an answer b) This is another c) You get the idea d) Hoo boy e) It gets tedious
xx. This is another question a) In case you weren't sure b) Presume there's five answers
xx. This is a third question a) There actually aren't answers b) Learning that is part of growing upSecond, shuffle the questions, i.e., shuffle the lines. Doing so links up Emacs and shell commands in a way that makes nerds all shivery:
M-< C-SPC M-> C-u M-| gshuf RET
xx. This is a third question a) There actually aren't answers b) Learning that is part of growing up
xx. This is a question a) This is an answer b) This is another c) You get the idea d) Hoo boy e) It gets tedious
xx. This is another question a) In case you weren't sure b) Presume there's five answersWhat's going on? First we move the cursor to the start of the
file. C-SPC sets a mark, i.e., starts highlighting. M-> moves the cursor to the end of the file, thus highlighting the whole
thing. M-| will pipe the highlighted region as standard input to a shell command of your choice. Normally the standard output of that command appears in a new window, but prefixing with C-u causes the standard output to overwrite the highlighted region instead.
The shell command is gshuf, the Mac OS implementation of the Unix
command shuf. It's available through the coreutils package. It
reads from standard input, permutes the line order, and writes the
result to standard output. The type of command that, once you learn
about it, you wonder how you lived without it.
Note that you can save after you run this, then run and save again. Now you have two versions of your file with the lines in a different, random order.
Third, renumber the file. A superb feature of Emacs's
regular-expression implementation is that including \# in the
replacement string will cause the command to replace the matched
pattern with an incremental count. This count starts at 0 by default,
but it's simple enough to start it at 1:
C-M-% ^xx RET \,(1+ \#) RET
1. This is a third question a) There actually aren't answers b) Learning that is part of growing up
2. This is a question a) This is an answer b) This is another c) You get the idea d) Hoo boy e) It gets tedious
3. This is another question a) In case you weren't sure b) Presume there's five answersFourth, turn the lines back into question blocks with another
replace-regexp:
C-M-% ␣\([abcde]\)) RET C-qC-j\1 RET
1. This is a third question 
a) There actually aren't answers 
b) Learning that is part of growing up
2. This is a question 
a) This is an answer 
b) This is another 
c) You get the idea 
d) Hoo boy 
e) It gets tedious
3. This is another question 
a) In case you weren't sure 
b) Presume there's five answersAdding a blank line between blocks is left as an exercise for the reader.
Why I learned Cyrillic
It took me 20 or 30 times as long to write this up as it did for me to _do_ it. On the other hand, I first opened Emacs in 2005, so you could say it took me nearly 15 years...?
I am not an "Emacs guru." I don't program in Elisp, I don't check my mail or Surf the 'Net in Emacs, and I don't care if you prefer Vi or Sublime or whatever. The thing that all of these pieces of software have in common, though, is a profoundly deep feature set, one that will reward continued engagement. Every few months I learn something new I can do in Emacs--not in the sense of a new mode (I want a niche T-shirt that reads SHUT UP ABOUT ORG MODE) but in the sense of a better way to edit text. Editing text is a huge part of what academics do for a living, and I think incremental investment is worth it.
People tell me they don't have the time to learn things like this. I can only say it so many ways, but: I didn't have the time either. That's why I've learned slowly. But each part of the effort was worth it. And certainly if you're in graduate school, and you have your career ahead of you, for god's sake spend some time engaging with software like this. If only so that, in a few years' time, I don't come across someone else trying to randomize their exam questions with an unholy combination of Wordpad and an Excel spreadsheet!
The trials of self replication
This was originally written on 17 July 2019.
I'm trying to calculate a spatially-weighted segregation statistic, one for which there's no canned routine. After fiddling with it in Python, I could output numbers rather than error messages. But were they good numbers? I paused to ensure that I could correctly implement the aspatial version of the statistic. I knew what its values should be, from a previous paper, so I could check my work against it.
Focus on the previous result for a moment. Consider this figure:
(From JP Ferguson & Rem Koning, “Firm Turnover and the Return of Racial Establishment Segregation,” American Sociological Review)
This is our estimate of racial employment segregation between US counties (in white), and racial employment segregation between workplaces within counties (in black). You want to keep the white separate from the black, and not because this is a study of segregation (hey-O); you don't want to confuse people's living in different labor markets with their working in different places within the same labor market. The former can still reflect segregation (see: spatial mismatch, sunset towns, Grand Apartheid, etc.), but of a qualitatively different type.
The decline and rise of the within component was the "A-ha" finding of that paper, and it's this that I set out to replicate with my shiny new code. The first attempt was not encouraging, though:
Hopefully the differences between the two graphs are obvious. I rolled my eyes, sighed, realized no one could observe me and that I was being melodramatic, and dug in. Per this blog's ostensible focus on process, I'll present my investigations in order rather than jumping straight to the answer.
Whence the differences? These new data are a strict subset of the old. Specifically, I calculated the new numbers on establishments that I successfully geocoded. Is there systematic bias in what was geocoded?
If you squint at the second figure, you'll see two weird spikes. The between component jumps up between 1985 and 1986, and the within component jumps up between 2006 and 2007. I've seen the jump in the within component before and know how to solve it; for now, focus on the between.
The government recorded three-digit ZIP codes in these data through 1985, switching to five-digit codes in 1986. That probably affected the match rate. I tabulated the accuracy rate for the two years before and after the change:
| year | total | accurate | accrate | 
| 1984 | 141137 | 108939 | .771867 | 
| 1985 | 140678 | 108642 | .772274 | 
| 1986 | 146455 | 94903 | .648001 | 
| 1987 | 148883 | 96409 | .647549 | 
Weird. I thought the accuracy rate would have gone up? The match rate definitely did:
| year | match=0 | match=1 | 
| 1984 | 3,529 2.50 | 137,608 97.50 | 
| 1985 | 3,469 2.47 | 137,209 97.53 | 
| 1986 | 199 0.14 | 146,256 99.86 | 
| 1987 | 187 0.13 | 148,696 99.87 | 
| Total | 7,384 1.28 | 569,769 98.72 | 
But wait, how can the accuracy rate go down if the match rate goes up? I summarized the accuracy score reported back from the geocoding algorithm, by year:
| year | mean_acc | 
| 1984 | .888 | 
| 1985 | .888 | 
| 1986 | .904 | 
| 1987 | .904 | 
Wait--this goes up, too? What the hell had I computed as the accuracy rate, then?
Going back through my code, I realized that I had calculated the "accuracy rate" as the proportion of geocoded observations where the reported accuracy was 1. And I'd kept only these. Yet those tables suggest that, while the average accuracy had gone up, the share of perfect accuracy had gone down. This is why the Good Lord gave us kernel densities:
With the three-digit ZIP codes before 1986, matching was basically an all-or-nothing proposition. From 1986, with the longer ZIP codes, more cases were matched, but more of the matches were good--i.e., 90%--but not perfect. By keeping just the observations with accuracy 1, I'd inadvertently thrown out a non-trivial number of the matched observations from the later years. I changed the code to keep if accuracy was .9 or greater starting in 1986.
The next step, just for sanity's sake, was to correct the post-2007 bump in the within component:
Better, though the level for the very early years still seems way too high. That might reflect which observations were being successfully geocoded in those early years. I know that the match rate is lowest in the early 1970s, so that seems likely.
I now wanted to see how the improved match rate after 1986 might have introduced bias. A crude way to do this is to look at how the distribution of missing, non-geocoded observations changed across different states. I calculated the share of non-geocoded observations by state, in 1985 and 1986; calculated the change in states' shares; squared that (to make outliers more obvious) and re-applied the sign of the change; and sorted the states by their signed squared change. That produced this graph:
This is telling. Before the change, California's establishments comprised a much higher share of the non-geocoded observations (hence the large negative change), followed by Florida's. Meanwhile, Illinois's establishments were a much lower share of the non-geocoded observations. This does not mean that fewer establishments were matched in Illinois in 1986 than in 1985; remember, the match rate goes up. Instead, far more establishments were added in California and FLorida than in Illinois.
This matters becuase the demographics of these states are very different. California and Florida had far more latino and Asian residents in 1985/1986 than Illinois did. Furthermore, some of the other "bulked-up" states include Arizona, New Jersey, and New York, all of which were comparatively ethnically diverse, while the other "slimmed-down" states alongside Illinois include West Virginia, Kentucky, and Washington. The change in how the government recorded ZIP codes in the mid-1980s changes the population of establishments in the dataset, increasing the relative size and diversity of some states and thus increasing the between-area segregation measure.
Two points are worth taking away from this. First, this implies that the jump after 1986 isn't the thing that needs explaining; the "drop" before 1986 is. And the explanation is sample-selection bias. This is good, because the later data are the more complete. I'd like them to be the less-biased! But my instinct was to assume that the earlier years were right and that something went wrong. I think this is a common way of reasoning, but it's also because the between component lies below the within component in the earlier years and in the first figure, and I was asssuming that was right. I'll come back to this.
Second, this isn't a "bump" that I can correct like the post-2007 one. The 2007 bump in was due to a one-time change that the government made to its data collection in 2007. That year, they made a special effort to survey firms that appeared in Dun & Bradstreet records and seemed like they should have filed an EEO-1 survey but had not. The survey responses that year included many new workplaces (the sample size jumps by nearly 2%), most relatively small, single-establishment firms with relatively high latino employment. That shift in the response population caused the 2006/2007 jump. But this particular effort didn't continue. The EEOC didn't continue to seek out such firms, and so over time, as the firms first surveyed in 2007 turned over, the respondent population converged back toward the pre-2007 type. That explains the sharp decline in the within component in the years after 2007.
To correct for this one-time shock, we removed the 2007 data from analyses, and we removed the firms that entered in 2007 from subsequent years' calculations. But we cannot do this with the 1985/1986 change. The shift from 3-digit to 5-digit ZIP codes was permanent, so the uptick in the between component just has to be treated like a permament shift. Fortunately we mostly care about trends, not levels!
So, I was almost there. But this is where the trials of self-replication really show up. You see, that first figure has an error in it! After we published the paper, Sean Reardon pointed out that we calculated the within component as a size-weighted sum of counties' between-establishment segregation measures, when instead we should have calculated it as the size-and diversity-weighted sum of same. We published a correction for this...
Tweeting about a recent publication corrigendum.https://t.co/j7MNkA12D6
— Rem Koning (@orgRem) May 15, 2019
I love seeing working papers and new pubs, but we all make mistakes.
In this case, after my paper w/ @jpferg came out in @asr_journal Sean Reardon pointed out that there was a mistake in our equations!
...and noted that, while the proper weighting did not change the broad shape of the within component, it did shift everything downward slightly. (This is because more diverse counties are somewhat less internally segregated. Giving those counties more weight lowers the within component, which is just a sum.) Voilà:
This is about the within component, though. The between component _is_ just a size-weighted sum, so it should not move when you make this correction. That implies that the between component has actually been larger than the within component almost this whole time! It's OK, we had no prior about this, and given that (e.g.) the between-industry component has always been bigger than the within-industry component of this measure, it isn't mind-blowing. But it does imply that when I rebuilt everything and got this:
...this is actually a faithful replication of the first figure, even though the two look different!
So what did I learn?
- Your goal shouldn't be simply to reproduce the published figures, even if they're your own. If there's a problem or error, with the published results, you need to replicate a fixed version.
- But you can't just call your slightly-different version fixed and move on. You also want to explain why there's a difference in the first place. In such instances, showing that you can replicate the published results by introducing the mistake into your code is a good way to convince yourself that you're on the right track.
- Replication is hard. Remember, these are my own data and code!
- I should have started with this, rather than going straight to the spatially weighted calculations (of which more later, probably). Now, if I modify the calculation and get different results, I have a lot more confidence that this reflects real differences, and not some problem with my replicaiton strategy.
Social science and the Big O
This was originally written on 18 June 2019.
...No, not that Big O. Behave yourselves.
One of my goals with this blog (insofar as a blog has goals) is to write about the work behind social-science research, to discuss the process as much as the product of this thing I do. It is useful to see the process, especially if you are newer to the field; but utility aside there is so much that goes into a research project that doesn't show up in the final product. One of the first, hard lessons of graduate school is that most of the work involved in producing a study is of interest to no one save the producer. Indeed you can often tell a paper is written by a graduate student because it lavishes space on the details of the matching algorithm or some other "input" step. You can hear the youngling saying, surely this sort of thing isn't meant to disappear forever, is it?
It isn't meant to be in the final paper, but that doesn't mean it's meant to disappear forever. Sometimes it would be useful to discuss these things, lest hard-won discoveries never spread. This post for example is about the efficiency of computation and things I learned relating to that, things that will only appear in the final paper insofar as they make that paper possible. But the blessing and curse of good research is that often you have to learn a little bit about a lot in order to do the job, because so much of our work is basically bespoke.
With that, a preface: I am not a computer scientist. Take the following as a clever layman's understanding of these issues, and expect me to get some details wrong.
Big-O notation is a thing in math and computer science that social scientists almost never encounter. You can read the Wikipedia entry for the details, but a common use of it is to describe how the compute time of an algorithm grows as the input data scales. As such it hooks into P vs. NP and all sorts of interesting ideas, but it often shows up in the workaday question, "Why the hell is my code running so slow?!"
There are many ways to implement most functions, and those ways can vary enormously in their efficiency. Fifteen years ago, R was notorious for the inefficiency of many of its commands. For lists above a certain size, it was faster to set up a counter in a for-loop and iterate over the elements in the list, incrementing the counter each time, than it was to run length(list)! In the worst case, a length function should run in \( \mathcal{O}(n) \), that is, compute time should be proportional to the size of the object. (Let's call that "linear time.") It takes a certain creativity to make that function take more than linear time. As a longtime Stata user, I cherished this tale about R, so much so that I've never bothered to check whether it is in fact apocryphal ;)
I have been reading up on big-O notation because I've been working on a project that involves spatial data. Bracket the research question. I have \( (x,y) \) data for about 100,000 establishments all over the United States, each year, for 40 years. For each establishment, I have to calculate a statistic that is based on the properties of nearby establishments, where "nearby" is defined by a radius that I choose. The statistic itself isn't complicated. Nor is it complicated, conceptually, to figure out which establishments are near any focal establishment. But computationally, this can be really inefficient.
In the naïve case, for each point, you must calculate the distance to every other point, flagging those within the radius. If you can't keep track of which comparisons have already been made (i.e., if you have to calculate the distance from A to B and from B to A), then this involves \( n \times (n-1) \), or \( n^2 - n \), comparisons. Even if you can keep track, that's still \( (n^2 - n) / 2 \) comparisons...and as \( n \) grows, that quadratic term swamps the \( 1 / 2 \) constant. Thus this naïve algorithm runs in \( \mathcal{O}(n^2) \), or quadratic time. Compute time here scales with the square of input size.
This is muy malo. From what I've seen on Stack Overflow posts, I'm working with data that's about 10 times larger than most GIS projects. A routine that runs in \( \mathcal{O}(n^2) \) will thus take about 100 times longer on my data. For the record, that has worked out to more than a day for each year's data.
But here's the thing: "nearby" in my case is about a quarter of a mile! I know that virtually all other establishments in the data are too far away to matter. What I needed was a fast way to zoom in on feasible candidates before further testing.
Adventures in set theory
My first thought was that, given the tightness of my radius, I'd just need to look at others within the same county--I have county recorded in the data. At worst, there might be establishments near the edges of their counties, in which case I'd have to check adjacent counties, too.
It's pretty easy to create an adjacency list for U.S. counties (really!). In python, make a dictionary, where they keys are the county FIPS codes, and the values are lists with the adjacent counties' FIPS codes (include the county itself):
county_adjacent = {
    1001 : [1001,1021,1047,1051,1085,1101],
    1003 : [1003,1025,1053,1097,1099,1129,12033],
    ...
    78020 : [78020,78030],
    78030 : [78020,78030],
}Then you can write something like this:
for county in county_adjacent[county]:
    for establishment in county:
        ** test for distance and do stuff **This greatly reduces the search space but is still pretty inefficient. Consider an establishment in Manhattan. There are six adjacent counties (Bergen and Hudson in New Jersey, plus the outer boroughs), and all of them are just lousy with large establishments. At the very least, this process will bog down in large cities.
Next, it occurred to me to first try a spatial test. If I care about establishments within a certain radius, that implies a circle around the focal establishment (that's why I was building circles in the previous post). I could take that circle, as a geometric object, and see if it intersected with the various counties' geometries. Then I'd only have to check establishments in the counties that the circle actually overlapped--in most cases, just one!
for county in county_adjacent[county]:
    if circle.intersects(counties[counties.county == county].geometry.iloc[0]):
        county_ests = ests[ests.county == county]
        in_area = county_ests[county_ests.within(circle)]
        sub_dyads = pandas.concat([sub_dyads, in_area])Two things to notice here. First, this code uses the geopandas package, so the dataframe-query structure follows that of pandas, while functions like intersects() and within() are ultimately being drawn from shapely. Second, this code involves three geodataframes: one with the establishments, one with the "circles" around each establishment, and one with the polygon geometries of all the counties. We'll come back to this use of dataframes and the like.
A surprise appearance of \( \mathcal{O}(n^2) \)
So far, so good. Code of this type took forever to run, though. It would start fine, managing some 3,000 establishments per minute, but it would gradually slow down. Indeed this took longer to complete than the original, naïve version--nearly 30 hours for 100,000 establishments! This isn't how programming is supposed to work, right? What was I doing wrong?
The problem lay in the pandas.concat() function. This takes two dataframes as inputs and returns their concatenation as an output. It might seem that this would just write the elements to the existing dataframe (in which case it would run in \( \mathcal{O}(n) \)), but no: it writes both elements anew. Think about what this means at every step:
- Step 1: write \( n_1 \) rows
- Step 2: write \( n_1 + n_2 \) rows
- Step 3: write \( n_1 + n_2 + n_3 \) rows
- Step 4: write \( n_1 + n_2 + n_3 + n_4 \) rows
- Step N: write \( n_1 + n_2 + ... + n_N \) rows
This amounts to writing data \( (n^2 - n) / 2 \) times, which means we're back to \( \mathcal{O}(n^2) \). No matter how many operations I was saving by my clever spatial sub-setting, I was more than gaining them back with this incredibly inefficient write process! (Before you ask, pandas.append() has the same behavior.)
The solution to this problem, at least, was to create results as a list of lists rather than as a dataframe, and then use list.append(), which runs in \( \mathcal{O}(n) \). This removed the dramatic slowdowns, but by itself it could not surpass the fastest initial performance of this type of code, which as mentioned was about 3,000 establishments per minute.
Detouring through search algorithms
Obviously, if I could figure out some faster way to subset the feasible candidates, I could go faster still. But how? By itself, Euclidean distance is pretty fast to calculate. And indeed that's what the within() function was doing. I began Googling phrases like "efficient algorithm to find points within distance" and the like. At some point, this led me to a page comparing linear to binary search.
Say you have a list. You want to find whether an item is in the list--or find its index in the list, which amounts to the same thing. If the list isn't sorted then the best approach is just to start at the first element and compare it to your item. If it matches, you're done! If not, take the next element, and repeat. On average, you'd expect to have to check \( n/2 \) elements, where \( n \) is the length of the list. This means that the compute time for a linear search is proportional to the length of the list--classic \( \mathcal{O}(n) \) stuff.
Now imagine that the list is sorted. Then the best approach is to start, not at the list's start, but in its middle. If the middle element matches your item, you're done! If not, then ask whether it's bigger or smaller than your item. If it's bigger, then you can rule out the entire second half of the list. If it's smaller, then you can rule out the entire first half of the list. Take the middle element of the remaining half, and repeat. After each test, you halve the remaining search space, so after \( k \) tests there are \( n/2^k \) elements left. The most searches we can do before having one item left will therefore be when \( n/2^k = 1 \); solving for \( k \) gives \( k = \log_2 n \). This means that binary search runs in \( \mathcal{O}(\log{}(n)) \), which is a damned sight faster than linear time!
From \( \mathcal{O}(n^2) \) to \( \mathcal{O}(\log{}(n)) \)
This seemed exciting. What if I could sort my establishments by their locations...? But that seemed problematic. \( (x,y) \) coordinates are by their nature two-dimensional. How would you order them?
Ah, but here's the thing: sorting is itself something that can be implemented very efficiently. So why not just sort by x, and then sort by y? Granted, I didn't need to find each establishment in the sorted list, but instead all establishments within a certain x- or y-distance from them. In a sorted list, though, this just amounts to finding the indices of the establishments at \( x \pm r \), where \( r \) is the radius. Once you have those, you have the subset of establishments that are feasible candidates, at least in the x dimension. If you do the same thing with a list sorted on the y coordinate, then you can get the subset of establishments that are feasible candidates, at least in the y dimension. Then you can find the list intersection (another efficiently implemented algorithm), and you're really quite close.
Consider it graphically, using a bit of Boston in 1971 from the last post:
With this process, I could very quickly extract a strip of land half a mile wide with all x candidates and very quickly extract another half-mile strip with all y candidates. Then the list intersection would correspond to establishments within the square, half a mile on each side, centered on the focal establishment. All that I'd need to do then would be to run something like distance() on that comparatively tiny number of candidates to check that they were in the circle, not just in the "corners" of the square.
At this point, I started implementing a binary search function, one that could return a slice of the list around the position in question, rather than the position itself. This is extremely simple in comparison. From there you just need the intersection of those two lists of lists:
def intersectionLoL(list_of_listsA, list_of_listsB):
    ''' Returns the list of lists that is the intersection of two
    other lists of lists. The tuple/set/list nonsense is there because
    lists are not themselves hashable objects '''
    return list(list(i)
                for i in list(set(tuple(i) for i in list_of_listsA) &
                              set(tuple(i) for i in list_of_listsB)))From there, you can define a function that checks whether points in that list are within the circle:
def withinCircle(intersect_list):
    ''' Takes a list of lists with (x_coord, y_coord) for points and
    (x_center, y_center) for some reference point. Returns the
    sub-list where the distance between those is below a quarter
    mile. (Yes, I know there's a lot of hard-coded garbage in here.) '''
    new_list = []
    for record in intersect_list:
        if math.sqrt((record[x_coord] - record[x_center])**2 +
                     (record[y_coord] - record[y_center])**2) >
                     .004167:
            continue
        else:
            new_list.append(record)
    return new_listObviously, there's more code involved than that to read in the data, create the sorted lists, and so on. Yet this gives me a way to find the relevant establishments around every focal establishment, a way that runs blazing fast in comparison to everything else I've tried.
You know I'm going to crow about it: whereas some of these approaches took more than 30 hours per year of data, this takes less than 20 seconds.
Genuinely, some lessons are learned
It's a cliché in the quantitative social sciences that we don't roll our own code, whether in data management or in statistical estimation. Instead we rely on canned routines. Often you can date the diffusion of a new estimation technique, not from when it is discovered or demonstrated to be better than earlier approaches, but from when it appears as a standard command in a popular stats package. And in general, I think this is fine. Replication is easier when implementations are transparent. Ideally, canned routines are also efficient routines.
But this isn't always the case. A benefit of open-source programs and user-generated code, the ease with which new routines can be distributed, is also sometimes a curse. Inefficient initial implementations make it into codebases all the time.
I want to consider a more subtle problem, though. Frameworks like pandas were developed to make it easier for people used to a statistical-programming environment to wrassle with a general-purpose language like python. Pandas implements the dataframe structure, which feels very comfortable to those of us used to R, Stata, or the like. There's a cost, though. Pandas dataframes leverage NumPy's numerical arrays, which means you have interpreted code running interpreted code. And then geopandas is is itself leveraging pandas, as well as shapely, which draws on its own set of libraries...
In most cases, this is all worth it. Most of us will accept longer run time if it takes us less time to think and write the code we run. Implementing routines at a lower level of the "comprehensibility stack" rarely saves time. Usually, the only benefit is that you gain skills that you'll use later on elsewhere, and it's too easy to trot out that excuse whenever you want to procrastinate!
Sometimes though you want to bend a package to do something its authors didn't have in mind when they wrote it. In those moments, all of their conveniences become things you'll have to work around, which usually means inefficiency. And if the goal is speed or even just feasible run times, This Is A Problem.
None of this is news to computer scientists. But social scientists aren't taught to think in these terms. There are entire classes of models that traditionally weren't often used because their implementations took so long to run (multinomial logistic regression, anyone?). This wasn't too severe a problem--but as our data sets continue to grow in size, the underlying implementations in our workaday software increasingly pose constraints.
The irony is that, thirty or forty years ago, discussion of specific implementations was very common in our literature. When computers were far weaker, it mattered more to get the method of computation right. Why do you think we all got taught OLS--because it is realistic or flexible? No. It's relatively easy to implement by hand, and has an analytical solution. Many early papers in network analysis have appendices discussing the algorithms for calculating the network statistics. The spread of faster computers and standardized software packages has definitely been a net positive, but this type of thinking has fallen out of fashion over the last generation. (I make an exception, of course, for anyone obsessed with Bayesian statistics.) Yet as Moore's Law slows down while data continues to increase, I suspect we'll all have to get back in the habit.
Will it work?
Basically, dealing with terror.
This was originally written on 8 May 2019.
Every research project has one or two assumptions that, if violated, torpedo it. These assumptions are the scariest things to test, and they can keep me from working. They often turn out to hold, which means that I procrastinated for no reason, but I can do little with that knowledge. This is how the anxious brain works.
This week I'm working on a project to measure spatially-weighted racial employment segregation. The core idea is simple. We've found that racial segregation between establishments (workplaces) has grown over the last generation, yet many of us would say that our work environments have grown more racially diverse. Can we square that circle? One possibility is that we do work in more diverse environments, but many of the people we encounter actually work for other employers. Think about outsourcing. I may see a cafeteria worker, janitor, or landscaper of a different race every day, but they needn't be a recorded employee of my employer. We interact, but on paper we are in different firms.
(This is all the more obvious now, having moved from the Bay Area to Montréal. Here, those positions are part of the same internal labor market. And lo, while their pay is below a professor's, they have many non-wage benefits similar to other university employees'. The tale of two janitors continues.)
I could get at this idea by using a spatially-weighted segregation measure, one that takes into account both which firms people work for and where those worklpaces are located relative to each other. A few papers lay out the mathematical machinery to do this, but almost none tie the pieces together. Doing so seemed like a fun idea and an important contribution. Hence this project.
I have more than 40 years of data from the Equal Employment Opportunity Commission, showing the racial composition of establishment workforces. I have geocoded that data, such that I have longitude and latitude for about 85 percent of them. In other words, I have maybe the ideal data with which to do this study. So why am I so anxious?
Partly it's banal. I have to figure out to implement a lot of the
necessary calculations. I wrote up a memo explaining how to do it,
which is how I learn it myself; but even if I understand how to do it in the abstract, I have to implement it in software. This means getting up to speed on a bunch of geospatial Python packages. At least nowadays there's geopandas, which leverages pandas to provide something like a sane data structure, but I don't know pandas that well...you get the idea. It's yak
shaving time again. But this isn't the real issue.
I have data for every large establishment. These are workplaces with at least 100 employees. Many contracted, external, or other workplaces might have fewer employees and thus not appear in these data. Instead I can look at the spatial proximity of large workplaces to each other, and get a sense from considering their spatial relations what the full population effect might look like. This means that this study will probably understate the effect of considering space (if there is one). By itself, this is not a fatal flaw. You'd rather have research designs that stack the deck against your finding an effect.
The root problem is this: what if large establishments are far enough apart that you can't get any meaningful overlap between them, unless you adopt an unrealistically large "reasonable" distance? I want to use something like a five-minute walk, or about a quarter of a mile. That will keep you on many corporate campuses, and keep you within a radius of one or two city blocks. Is that sufficient?
I couldn't know in advance. And this is what has been eating at me.
This is "research week" at Desautels. Those of us faculty who sign up are released from service work for the week (siiiiiiiiiiiiick) and given the space to make a big push on a particular project. I decided to work on this one. Partly because I need a week of concentration to learn some of the software and write the code, but mostly because daily meetings with my group would give me a commitment mechanism to see this through. It's always useful to find tricks that keep you working!
So what did I find?
Monday was spent building the dataset: merging the geocoded addresses (which we assigned dummy IDs to preserve confidentiality) back into the main data, and smaller, annual datasets that would be fast to analyze. Tuesday was spent turning those CSVs into ESRI shapefiles, installing a shapefile viewer and other relevant Python packages, and generally prepping things for the real calculations. Which meant that at some point yesterday, I could open QGIS and see this:
That's every large workplace for which I have geospatial information in 1971. I'm showing 1971 for two reasons. First, if things look OK nearly fifty years ago, when there's less development, they'll look OK today. Second, even though it would be incredibly hard from these dots alone to identify specific employers who filed an EEO-1 survey in that year (and in any case are legally required to do so), I have to keep employer identities confidential. Using data from 48 years ago makes it almost impossible that even the most brilliant sleuth would back out a current workplace from these plots.
The action comes when you zoom in. Here's metro Boston:
A lot of the circles overlap! Like, a lot of them. This strongly suggests that these large workplaces are close enough to one another to give some leverage to a spatial measure.
Similarly, here's San Francisco--again, 48 years ago:
You can easily see downtown and the financial district in the city's northeast. Those familiar with SF can also probably eyeball a few major streets, like Geary and 19th Avenue, from clusters of firms along them.
So far, so good. But these are big, dense cities. Is the overlap only meaningful there? I don't need the rural areas to be that dense--fewer people live and work there, so by definition they're less important for aggregate measures--but what about the suburbs?
Here's Carrollton, Texas:
Carrollton is a first-ring suburb of Dallas. It's next to Lewisville, where I went to junior-high and high school (Go Farmers, etc., etc.). More importantly, in 1971 it was what we'd call a shitheel burg: A suburbanizing farm town that had had large employers set up shop along the I-35 corridor. Back then, Carrollton had a bit more than 13,000 people; Lewisville had fewer than 10,000. Both of these cities have more than 100,000 residents today. If I can get some relevant spatial overlap in these places that long ago, I feel pretty good about the data overall, and thus am more relaxed about the project's potential.
(I can already hear people asking: If Carrollton was that podunk back then, why were there so many large employers (relative to size) there? Just to the southwest of the area clipped in that image is DFW Airport, well on its way to becoming one of the nation's busiest. The companies locating to Carrollton included firms like Halliburton, which counted aircraft supply among its operations, and where a great-uncle of mine would later work. He was a Teamster and machinist, smoked 2-3 packs of coffin nails a day, and surprised no one by dying of lung cancer. I still have a box of drill-press bits he gave me, more than 25 years ago. I digress.)
So what's the point?
This exercise falls under the broad topic of exploratory data analysis. When I was in graduate school, I sometimes got the sense that exploratory data analysis was what you did to come up with a research question, which today strikes me as really shady. A better way to think about exploratory analysis is getting to know your data. The scary assumptions I've been talking about are also sanity checks. It makes no sense to proceed if this assumption I've been talking about doesn't hold. No matter how elegant the math, if there isn't enough spatial overlap to investigate the issue, we're done.
It's tempting to ignore these sanity checks as long as possible. But that's perverse. Imagine you get no results, and eventually discover that one such assumption didn't hold. That's even more time you've wasted. Worse, imagine that you never thought to check this. Many are the studies I've seen with odd or nonsensical results where I think the researcher never spent time on basic investigation of their dataset.
But I don't want to focus too much on the negative. There's also a strong positive reason to do this work. Having seen that the data pass this smell test, I'm way more confident and excited about the project. That burst of relief and self-confidence will propel me through another few days' fumbling with code. Forward!
Not denying the data
Learning from what seem like mistakes.
A friend of mine and I recently discussed a comparison between how researchers research and how programmers program. The focus was on "finding effects" and "debugging." I excerpt his comment here:
Most programmers are poor debuggers who guess at fully-formed theories about why their programs are misbehaving (e.g., "could it be that, in module A, X is happening when Y?") and then try to prove them right, learning very little--or often nothing, because they forget the random little facts they have learned--each time they guess wrong. Truly good debuggers try to prioritize the experiments that can most easily yield the most information (e.g., "is there a quick way to establish whether the problem is in module A or module B?"). They are relatively rare.
I really like this point. Many social scientists have a theory and try to figure out why they're not finding the results that they "know" they should find.
This isn't necessarily a bad thing. Eric Leifer has a classic piece about Denying the Data (Gated, sigh), one that presages Judea Pearl's recent and very good The Book of Why. Their point is that data can tell you very little without a theory. (Pearl is excellent on how the correlational structure of observed data can never tell you about counterfactual outcomes, which are the building blocks of virtually all causal theories.) The irony though is that, even as many researchers espouse a sometimes-misplaced belief in listening to their data, they don't listen to it. People often skip any theoretical integration of the empirical "anomalies" they find in their data. Yet these are important! Indeed, these are what Thomas Kuhn pointed to as the fuel of new theories--of paradigm shifts, in his now-overused parlance.
I obsess on this point because one of my "tricks" or heuristics when doing research is to pay attention to how the limitations of data or empirical methods can subtly bias our theorizing. There's a good enough encapsulation of this idea in Dana Mackenzie's The Universe in Zero Words, which I re-read yesterday, that it's worth quoting at length. Mackenzie is discussing one of the roots of chaos theory, Edward Lorenz's attempted simulation of weather using a system of non-linear partial differential equations.
To recap for those just joining us: Lorenz ran his simulation one day in 1963, then had cause to run it again later on. When he ran it the second time, the (deterministic!) simulation produced wildly different results. Lorenz eventually realized that, when he programmed the second run of the simulation, he'd truncated the starting values for his parameters. The truncations were tiny--like, the sixth and seventh decimal places--and prevailing thinking held that they just didn't matter for the long-run behavior of the system. Yet his results said otherwise. Here was one of the first hallmarks of what is now often called chaos theory: extreme sensitivity to initial starting conditions. It is this idea, combined with fundamental limits on the precision of our measurement, that rules out things like long-run weather prediction. Now I'll hand the mic to Mackenzie:
The importance of Lorenz's paper was not immediately apparent; it was buried in a specialist journal, read only by meteorologists. However, the same process repeated in other disciplines. Michel Hénon, an astronomer, discovered chaos in the equations governing stellar orbits around a galaxy's center. David Ruelle, a physicist, along with mathematician Floris Takens, discovered strange attractors in turbulent fluid flow. Robert May, a biologist, discovered chaos in the simplest system of all: a one-variable equation that had been used for years to model populations competing for scarce resources.
Each of these pioneers was isolated at first, and they all faced disbelief from other scientists. A colleague of Lorenz, Willem Markus, recalled in James Gleick's bestselling book Chaos: Making a New Science what he told Lorenz about his equations: "Ed, we know--we know very well--that fluid convection doesn't do that at all."
This incredulity is perhaps a typical reaction to any paradigm-altering discovery. In the case of chaos there were specific reasons why mathematicians and other scientists had been so blind for so long. When mathematicians teach their students differential equations, they concentrate on the simplest, most understandable cases. First, they teach them to solve linear equations. Next, they might teach them about some simple two-variable systems, and show how the behavior of solutions near a fixed point can be described by linearizing. No matter what the number of variables, they will always concentrate on equations that can be solved explicitly: x(t) is given by an exact formula involving the time t.
All of these simplifying assumptions are perfectly understandable, especially the last one. Solving equations is what mathematicians do...or did, in the years BC (before chaos). And yet these assumptions are collectively a recipe for blindness. Chaos does not occur in a continuous-time system with less than three variables; and it does not occur in any system where you can write a formula for the solution.
It is as if mathematicians erected a "Danger! Keep out!" sign at all of the gates leading to chaos. Scientists from other disciplines--biologists, physicists, meteorologists--never went past the "Keep out!" signs, and so when they encountered chaos it was something utterly unfamiliar.
The bolded emphasis is mine. Notice how empirical simplifications become implicit and lead to false certainty. This happens all the time, and it is rarely appreciated. It is a process that blocks off research on all sorts of questions; yet it is a pernicious process precisely because it is so implicit. It is pernicious because what amounts to a barricade at the frontier of knowledge is not recognized as a barricade. Willem Markus did not tell Edward Lorenz that no one knew how fluid convection worked; he told him that they knew how it worked, and it didn't work the way Lorenz's model suggested. This, as Mackenzie writes, is a recipe for blindness.
I have encountered instances of this several times over the years.
Employer illegality and union elections
When I was in graduate school, I started looking into the effect of employers' intimidating or firing workers for union activity on the success of union organizing. The prevailing wisdom, based on two decades' analysis of union-election records, was that this had almost no effect: elections where there'd been a ULP charge against the employer were won by unions at about the same rate as elections where there hadn't been. This had led to an entire body of thinking that assumed workers' preferences were basically fixed, that employers couldn't affect them much, and therefore declines in success rates had mostly to be explained in terms of why workers didn't want to join unions as much anymore.
The problem with this thinking was that researchers had been using election files that were recorded and published by the National Labor Relations Board, but these files only recorded the elections that had actually taken place! Petitioners can withdraw their requests to hold elections. Petitioners are usually unions, and they often withdraw when they think they are going to lose. Once you recognize this, then before you even do the analysis you can understand why people had gotten the results they did: they were comparing elections where there'd been no employer shenanigans to elections where there had been, but where the union decided to go ahead with the election regardless. If union organizers are OK at gauging worker sentiment, these two groups should have similar success rates--but that doesn't tell you whether the employer's actions raised the withdrawal rate. Lo and behold, it did, massively. And once you allow for employer actions to affect organizing drives before the elections, the whole theory of fixed worker preferences falls apart.
Recombination and innovation
There's a theory in innovation research that inventions that combine more disparate ideas are more impactful, in terms of future inventions that build on them. This sort of "recombination" has been demonstrated with patents and their citation patterns to prior work, and it is widely celebrated. Yet there's another body of organizational research that suggests that diversified product offerings are harder for evaluators to make sense of and to legitimate.
This is what I call the Black Baseball Player Effect. If you were to compare major-league ball players in the 1950s, you'd find that black players out-performed white ones on almost every dimension. Does this mean that being black makes you a better ball player? Not necessarily. Black players had to clear a higher threshold to appear in the major leagues. You have to account for that selection bias. Indeed, as Chris Rock has joked, you can tell that baseball truly integrated by the late 1960s/early 1970s, because that's when we started to see mediocre black players!
I thought something similar might happen with inventions: "recombinant" inventions might be more impactful, but they might also be harder to get patents for, in which case the two effects would be conflated. I you jointly modeled application and citation, and adjusted for this differential selection, perhaps you'd find that the innovative impact of recombination was overstated. I discussed the idea with a colleague of mine who worked with patents. He told me that it was a neat idea but a non-starter--because "Virtually every patent application is approved." That is, there wouldn't be enough variation to do a meaningful analysis.
I took this to heart--he was the expert! and put the idea aside for more than a year. Later I had reason to start working with the European Patent Office's data with my colleague Gianluca Carnabuci. We quickly discovered that, while almost no patent applications were explicitly rejected, only about 70 percent were approved. The remainder were mostly patent applications that were "deemed withdrawn." These are applications where the patent examiner came back with a search report that (usually) suggested the applicants had ignored significant prior work, and either that the patent would not get granted or would have a much narrower scope (and thus be worth less). At this point, the applicants abandoned their application, and the application was "deemed withdrawn."
This sort of thing happens all the time. There are far fewer convictions for crimes than there are crimes committed, or even than there are arrests made. Most arrests and investigations end in pleas or settlements. But if what you care about is whether an invention is granted patent, surely these applications that are deemed withdrawn are relevant. Why had researchers ignored them?
For many years, the USPTO, whose data most researchers used, did not systematically record their failed applications. Thus the USPTO data file only had successful patents in it. People eventually worried about selection bias, but when they asked about the rejection rate, they found it was extremely low. Thus it seemed not to be a problem, and soon enough the belief took hold that "every application is approved."
For what it's worth, yes, differential selection does bias citation rates upward for recombinant inventions.
Racial segregation between workplaces
There's a classic argument, coming from Gary Becker, that more intense competition between firms should reduce employment segregation. You can think of segregation as a taste employers have, one that leads them to hire, not the best workers for positions, but the best workers of their preferred groups. This produces a less efficient organization, one that should be driven out of business when competition is stiff. The late and lamented Devah Pager had recently published a study that seemed to support this idea: employers who discriminated more in her audit studies were less likely to survive.
A couple years ago, Rem Koning and I thought we could look at this using the EEOC's employment-composition surveys. In a population of firms, there are two main ways you can integrate. First, individual firms can integrate over time. Second, less-integrated firms can be replaced by more-integrated ones. We knew that racial segregation between workplaces had declined over time; if the Becker hypothesis were true, then the within-firm component should have accounted for more of the decline in less-competitive industries.
After some searching for the appropriate metric, we settled on Theil's information statistic, and wrote Stata code to calculate between-firm racial segregation. We ran a scatter plot for yearly averages, and decided we'd written our code wrong--because it was going up.
Much investigation followed, but eventually we concluded that our code was right: racial segregation between workplaces was higher than it had been a generation earlier. How had no one noticed this?
Here again, the issue seemed to be with how data had been gathered. Most work on employment segregation leveraged individual surveys, like the GSS, CPS, ACS, and so on. In a survey, you cannot ask, "Alice, what is the racial breakdown of your employer's workforce?" and expect a good answer. But you can ask, "Alice, what do you do for a living?" Thus almost all large-scale research studied occupational segregation. And this has declined over time--we could reproduce everyone else's results with our data. But occupational segregation asks how predictive race is of what you do, conditional on where you work. It cannot tell you how predictive race is of where you work.
Now, the workplace of the typical white worker (including the typical white social scientist!) has diversified over the last generation. But the workforce has diversified much faster; hence the rising metrics. If you put an assumption of declining establishment-level segregation together with empirical data on declining occupational segregation, you have one of Mackenzie's recipes for blindness.
I hasten to add that we were blind, too--recall that we "knew" that racial segregation between workplaces had declined over time at the start of the project! The trick or heuristic, here as in the other cases, was not privileged knowledge on our part. It was a willingness to pay attention to how empirical anomalies in our data might alter our theories.
That was a lot of text
I don't have a whizz-bang conclusion to this post. There's a lot of argument from example here. Still I think this point is under-appreciated. Often, I am described as an empiricist, but I don't think that's quite right. I care a lot about theory, but theories are ultimately generated from empirics. It is in the frequently staccato interaction of the two that I find the most interesting work can be done.
A vow I made
If you care about the results either way, the moral quandary disappears.
This was originally written on 18 March 2019.
This has to do with a paper that I published in 2015 but that I started working on in late 2012 and substantively wrote in 2013: "The Control of Managerial Discretion: Evidence from Unionization's Impact on Employment Segregation." In many ways, that paper riffs off of work that had been published a few years earlier, work that investigated the effect of firms' diversity policies on their actual workforce composition. That work is both very good and limited: very good because it looked at longitudinal, within-firm changes rather than just changes across a population of firms; limited because the authors could not model and adjust for the self-selection of firms into adopting such policies. (All good research is limited in some way, but that's a topic for another post.) My idea was to look at union-representation elections, because unions impose many of the same constraints on arbitrary management that diversity policies do, and yet unionization isn't self-selected by the employer. It _is_ self-selected by the employees, but because they do so through election records, I could focus on very close elections and use a regression-discontinuity design to identify the treatment effect net of self-selection.
Let me pause and emphasize that I was expecting to find effects like that earlier work had found. I figured I'd show similar trends, but support them with cleaner causal identification. That'd be a real contribution!
I was by myself on Christmas Eve 2012, and I spent most of the day confirming I'd built the dataset correctly and then running the analysis. Indeed, when I looked at the full dataset, I found results that looked like those earlier studies!
...And the moment I started zooming in on the closer elections, all those results went away. When you adjusted for self-selection, there appeared to be no effects at all.
I spent much of that Christmas Day in something like panic. I had this design for a paper, I had this thing I was going to show, and the results were actually the opposite. I felt like I'd just wasted several weeks--or months, if you count the time of getting the data in the first place. Then my father and step-mother came into town, and I put the paper aside for a week while they visited.
Shortly after the new year, back at Stanford, I opened up the paper and looked at what I'd written. I'd written most of the front end of the paper already. That front end wasn't about why it was theoretically important that the control of managerial discretion improved workforce diversity. Rather, it discussed this theory, but then explained how the evidence for it had this weakness around self-selection. Then I explained my own research design, and how it would help with that problem.
At some point, reading this, the lightbulb went off. I'd gotten the opposite results from what I'd expected, but I didn't really have to change the front of the paper at all!
That this was a revelation to me, 3.5 years into my assistant professorship, says several things about how we were implicitly taught to do research.
- First, we were taught that good papers made a theoretical contribution. But a theoretical contribution was almost never couched as a contribution _to_ a theory, such as better evidence. Rather, it was couched as a new theory. It might build theoretically on existing work, but if it only built empirically on what was out there, it wasn't interesting. 
- Second, we were taught that replication studies were boring and uncreative. These were the mark of a workmanlike but probably uninspired student who couldn't come up with their own ideas. (This probably isn't obvious today, as the social sciences are roiled by the replication crisis, but when I was starting graduate school sixteen years ago, it was assumed that replication studies would replicate.) 
- Third, we were taught that "You can't learn anything from a null result," full stop. 
Today I disagree with all of these points, which I'll detail in a moment; but what's really striking here is that I'd learned about causal identification and research design since my first classes in graduate school. In my first research design class, we'd devoted a ton of time to the Fundamental Problem of Causal Inference. In labor economics, Carolyn Hoxby had drilled us on the Program Evaluation Problem and the Heckman/Lalonde debate. Chris Winship taught a whole class on "The New Causal Analysis," preparatory to overhauling the core sociology methods class. (Yes, I got my PhD at MIT, but I did a lot of coursework at Harvard Economics and Sociology.) We read Rubin on the potential-outcomes model, Pearl on directed acyclic graphs, Angrist on instrumental variables, Van der Klaauw on regression discontinuity...hell, it was a paper with a regression-discontinuity design that had prompted this freak-out!
This all points to something I almost never see talked about. Within organizational research, and in business schools more generally, we absorbed many of the arguments for and techniques of causal identification without necessarily updating our assumptions about how knowledge generation works. We had been educated in a framework that presumed routine theory generation and predictable empirical support for those theories. Causal identification was imported as a way strengthen that support, and maybe to raise the minimum bar for what would be considered support. But little else changed. And in most places, I think it still hasn't.
Consider those three points:
- First, today I think that there are many types of contributions that research can make. New evidence in support of or against an existing theory should be considered a theoretical contribution. After all, our faith in theories is not binary. It is, basically, Bayesian. Or at least it should be. 
- Second, replication studies are neither boring nor uncreative. Indeed, one of the reasons they are not boring is because often earlier studies cannot be replicated. But the term "replication study" is itself over-applied. Trying to test an existing theory with a new method, with better data, with cleaner identification--none of these things is rote replication. Such studies often involve considerable creativity of their own. 
- Third, it is correct that you cannot learn anything from a null result in an unidentified study with observational data. But of course you can learn from a null result in a well-designed experiment. Even a quasi- or natural experiment's null results can tell you something. The experimentum crucis, since its coining by Hooke and Newton, has been a core piece of the scientific method. The logic as it applies here is simple: if we have a theory that makes predictions, and if we agree in advance that a study design is adequate for testing those predictions, then a null result in that study should reduce our prior confidence in that theory. (I think it's canonical to reference the Michelson-Morley Experiments here.) 
Hence me, about a week after killing the results in that paper, coming to the realization that I had learned something. I'd reproduced the earlier results when I didn't control for self-selection, then killed them off when I did control for it. This shouldn't make me despair over the study; it should reduce my confidence in the theory.
This was the most liberating moment of my early career. I'd been socialized to have one of two responses at this point. Having assembled my data and found null results, I was either supposed to abandon the project (maybe put it in a file drawer) or continue my "exploratory analysis" until I found out why there wasn't an effect--in the process finding an effect elsewhere--and "reframe" the paper around that. (At this point someone may say that that wasn't what I was "supposed" to learn from my training. Maybe. But I'm a reasonably intelligent man and a good student, and this all seemed pretty unequivocally communicated to me. More, I've talked to enough of my colleagues to know that I wasn't the only one who imbibed these beliefs.) But, I now realized, I didn't have to do either of those things. I'd found a null result, one that contradicted earlier research, but I thought it was right. That null finding was a contribution in its own right. Yes, the paper would be harder to publish, probably, but that didn't matter. I'd found something I thought was real and should stand by it.
Which brings us to the vow. That day, I vowed never again to start a project unless I thought that its question was interesting however the answer shook out.
Perhaps this sounds banal. This after all is how science is "supposed" to work. But my experience is that it still hasn't really sunk in in my field. When I present null-results papers, for example, I still get suggestions of different ways to slice the data such that I'd be more likely to find an effect, which (it usually follows) will make the paper easier to publish. But we're not in this business just to publish papers, or for that matter to find effects. We're in this business to answer questions.
Down in the muck of data
In which I eventually find myself face to face with a punched-card typewriter.
This was originally written on 11 March 2019.
I've been back in the ULP dataset. I was working in it about three weeks ago and set it aside while I went down to Boston to work on other projects. I had some spare time this weekend, though, and decided to open it back up.
I'd foundered on cleaning up and encoding the counties that I talked about in [this post][0]. Both state and county names were saved using FIPS codes, but the data has plenty of typos. This shouldn't surprise us. All of these records were typed in by a person, and there are more than 440,000 of them. Even a 99.5% accuracy rate would leave more than 2,000 corrupted entries. Thus cleaning, which always takes the same dreary form:
- Tabulate the recorded values
- Flag any values that shouldn't appear, given your list of acceptable values
- Try to resolve what the anomalous value should have been, if possible
- Set any remaining anomalous values to missing
- Encode the sanitized list of characters into a factor variable
This is simplest with states, which are only supposed to have around 50 values. You know that 37 is a legitimate value because it's a valid FIPS code for a state (North Carolina, FWIW). By contrast, &7 is an error, because there is no FIPS code with an ampersand. And 3 is ambiguous--it could be 03, 30, or something else entirely. Finally, a number like 81 is outside the range of FIPS values. In all of these cases, I would set the state value of those records to missing. So, I tablute the state variable, find all of the typos, and do that:
stateStrikes <- c('  ', '& ', '&7', '0 ', '00', '0D', '2Y', '3 ', '4 ', '52',
                  '53', '54', '55', '5K', '61', '67', '70', '71', '72', '73',
                  '74', '79', '81', '87', '91', '92', '97', '99')
for (i in stateStrikes) {
    ulp <- ulp %>% mutate(state = na_if(state, i))
}Ditto for counties. There're a lot more counties, and county is a three-character string, so there's more room for error. But for many types of typo, at least, the process is the same:
countyStrikes <- c('   ', '  3', ' 09', ' 25', ' 27', ' 57', '-04', '-71',
                   '..2', '{35', '{39', '{79', '{EG', '{GE', '}00', '&  ',
                   '&&&', '0  ', '0 7', '0}1', '00 ', '00}', '00+', '000',
                   '01 ', '02}', '03 ', '04 ', '05 ', '05}', '06 ', '06}',
                   '08 ', '09 ', '0B9', '0X1', '1  ', '1 4', '1 5', '1?7',
                   '10-', '13 ', '13}', '15 ', '16 ', '16-', '17 ', '20 ',
                   '23}', '24}', '25}', '3  ', '3 1', '30{', '30}', '36{',
                   '37 ', '41 ', '7  ', '70}', '75}', '9  ', '90 ', '93}',
                   '96}', 'A13', 'A27', 'A45', 'C60', 'G50', 'J30', 'K40',
                   'K50', 'L40', 'Y11')
for (i in countyStrikes) {
    ulp <- ulp %>% mutate(county =
                              replace(county, county==i, NA))
}It's more complicated for counties because the list of invalid county FIPS codes varies by state. Thus 007 is invalid in Delaware (which only has three counties) but fine in New Jersey. Most county FIPS codes are odd, but when you add a county after the fact (I'm looking at you, La Paz County, Arizona!) or have an independent city (Baltimore, St. Louis, too much of Virginia), they follow different schemes. There's no real way around this other than to tabulate the values by state, go through each state, and flag unacceptable ones. This is what a laptop and a lot of documentaries that you can listen to more than watch are for...
Eventually I created a named list of lists, then used a nested loop to visit each state and set its individual list of invalid values to missing. The savvy among you will recognize this as a dictionary bodged into R:
countyNAs <- list(c('008', '066', '143', '145', '151', '155', '201', '295',
                 '300', '352', '510', '521', '745', '889'),
               c('006', '010', '020', '026', '030', '040', '046', '050',
                 '060', '063', '070', '077', '090', '110', '111', '113',
                 .
                 .
                 .
               c('040', '049', '079', '095', '141'))                 
names(countyNAs) <- c("Alabama", "Alaska", "Arizona", "Arkansas",
                      .
                      .
                      .
                      "West Virginia", "Wisconsin", "Wyoming")
for (i in names(countyNAs)) {
    for (j in countyNAs[[i]]) {
        ulp <- ulp %>% mutate(county = replace(county,
                                               state==i & county==j, NA))
    }
}So far, so good.
The limits of cleaning
Let's note right here how often judgment enters into this process. We don't talk about it enough. Most datasets have problems--typos, missing data, machine or language-encoding hiccups, shifts in coding schemes. The analyst has to make decisions about how to deal with these, and historically those decisions were almost never documented. To this day, no one wants to write up every single design choice they made. That is why it's important to share your data and code; but also that's why it's important to make lists like these, however tedious they seem. If I were to go through my data in Excel or the like, changing errors as I saw them, there would be no record of what I'd done, short of analyzing two datasets side by side. Even then, different rules could produce similar results in some cases. Don't just clean your data; whenever possible, clean your data with code.
Having finally cleaned up the counties and managed to encode them, I moved on to the complainant and respondent variables. Per the codebook, these are four-character strings:
Notice that there's no apparent translation for how to interpret those four characters! Elsewhere in this codebook they refer to another PDF that lists union names, of which I have a copy:
It seems that 036 corresponds to the Retail Clerks, per the
codebook's example, so that's good. But what does the 7 at the start mean? And sure, I can imagine they use 000 to mean employer--or does that mean employer association, while R means employer? And what code indicates an individual? 
Here, I'm running up against the limits of the available documentation. For my research, I mostly care whether it's the union or the employer filing the complaint, and I can probably back it out of these data, but it's worth noting that I cannot proceed from here without making some judgment calls. For the moment, I'm going to focus on the last three characters in these variables, and come back to the first ones.
Suspicious error patterns
With that in mind, I tabulated the last three characters of the
complainant variable, and started the same procedure as for states and counties. Eventually I compiled this list:
complainantStrikes <- c("   ", "  -", "  }", "  &", "  1", "  2", "  3",
                        "  {", " { ", "{  ", " } ", "}  ", " & ", "&  ",
                        "  0", " 0 ", "0  ", "  1", " 1 ", " 3 ", "3  ",
                        "  4", " 4 ", "4  ", "0{ ", "0} ",
                        "  5", "  9", "  {", "  &", " 02", " 03", " 05",
                        " 07", " 21", " 22", " 23", " 25", " 26", "  3",
                        " 30", " 33", " 36", "  4", " 41", " 42", " 43",
                        " 46", " 50", " 51", " 52", " 63", " 71", " 72",
                        " 78", " 79", " 83", " 87", "---", ".23", "{00",
                        "&&&", "  0", " 0 ", " 1 ", "0 &", "0 0", "0 1",
                        "0 2", "0 3", "0 5", "0 6", " 0{", "0{&", "0{0",
                        " 0}", "0}5", " 00", "00 ", "00-", "00{", "00}",
                        "00&", "00C", "01 ", " 01", "01N", "01O", "01R",
                        "02 ", "02O", "03 ", "03A", "03J", "03O", "05 ",
                        "05M", " 06", "06 ", "06J", "06L", "07K", "07N",
                        "08 ", " 08", "0A5", "0A6", "0B3", "0C2", "0C6",
                        "0F1", " 0L", "1  ", "1 1", "1 3", "1 5", "1{3",
                        "1{5", " 10", "10 ", "10+", "10J", "10L", "10N",
                        " 11", "11 ", " 12", "12 ", "12J", "12N", " 13",
                        "13 ", "13}", " 14", "14 ", "14J", "14O", " 15",
                        "15 ", " 16", "16 ", "16O", " 17", "17 ", "17J",
                        "17K", "17N", "17P", "17Q", "17R", " 18", "18 ",
                        "18M", " 19", "19 ", "1D1", "1E1", "1H4", "2 1",
                        "2 3", "2 4", "2 9", "2{4", " 20", "20 ", "20F",
                        "20J", " 21", "21 ", "21+", "21A", "21C", "21J",
                        "21L", " 22", "22 ", "22K", " 23", "23 ", "23C",
                        "23L", " 24", "24 ", "24O", "25 ", " 25", " 29",
                        "29 ", "29J", "29K", "29P", "2A9", "2C3", "2D6",
                        " 3 ", "3  ", "3 1", "3 3", "3 8", "3 9", "3--",
                        " 30", "30 ", "30L", "31 ", " 31", "31L", " 32",
                        "32 ", " 33", "33 ", "34K", " 35", "35 ", "35*",
                        "35N", " 36", "36 ", "37J", "3B2", "3D2", "  4",
                        " 4 ", "4  ", "4 1", "4 2", " 40", "40 ", " 41",
                        "41 ", " 45", "45 ", "45J", " 46", "46 ", "46L",
                        " 47", "47 ", " 48", "48 ", "4F3", " 5 ", "5  ",
                        "5 3", "5 9", " 54", "54 ", " 56", "56 ", "5D1",
                        "5F3", "  6", " 6 ", "6  ", "6 1", "60L", " 61",
                        "61 ", " 62", "62 ", " 63", "63 ", " 68", "68 ",
                        " 71", "71 ", "72G", " 75", "75 ", " 76", "76 ",
                        "  8", " 8 ", "8  ", "8 1", "8 2", " 83", "83 ",
                        " 9 ", "9  ", "9 6", " 99", "99 ", "V00")Yes, I'm aware how long that list is--I had to type it. But typing has its uses; you notice when you're hitting some keys more than others. For example, there are a lot of errors that include the first few letters of the alphabet. That might just be an encoding issue. But there are also a lot of errors that include keys from the right-hand side of the home row. In particular, the letters "J", "K", and "L" show up a lot more than you'd expect from random error, with "M", "N", "O", and "P" somewhere close behind.
That's weird, right? Erroneous "Q"s, "A"s, and "Z"s show up a lot, because they're right next to the Shift and Tab keys. "O" and "P" maybe, since they're below the "9" and "0" keys. But "J"? It's right in the middle of the letter keys.
Then it hit me: these weren't typed on a modern keyboard. These data come from punched cards, and those would have been typed on a specialized keypunch machine. I have _no_ idea what those keyboards look like. To Google!
I have reason to suspect that the NLRB and AFL-CIO were using IBM mainframes. Even if I didn't, those were the most common type. With the rise of System/360, the most common keypunch machine was the IBM 29:
If you squint, there are some telltale marks on that keyboard! Let's "enhance":
(By Maximilian Dörrbecker - Own work, CC BY-SA, 3.0, Link)
Bingo. The IBM 29 didn't have a separate numeral row. Instead there was a telephone-style numeric keypad on the right-hand side of the keyboard. Since punched cards didn't use upper- and lowercase, you had "number-shift" and "letter-shift" keys (the "numerisch" and "alpha" keys here). If you stuttered on a "4", you'd enter a "J", and so on. That accounts for a lot of these typos. It even suggests where they ampersands might be coming from--a missed "3"!
Delving deeper
This doesn't solve all problems, of course. It still isn't clear where all the "A"s, "B"s, and "C"s are coming from, nor what that first character in the complainant and respondent variables is supposed to represent. But whenever I find something like this--a connection between a seemingly random string of digits and the mechanical or human processes that generated it--I feel like I'm coming to understand my data better.
The next and biggest mystery, at least on this front, is where all the curly-brace characters are coming from! Bear in mind that, because these were punched cards from an old IBM mainframe, they were encoded in some variant of EBCDIC instead of ASCII, and EBCDIC didn't have curly-brace characters. This suggests some sort of encoding-translation error, but parsing that may be a bridge too far, even for me.
Matrix math
Linear algebra for fun and profit.
This was originally written on 7 March 2019.
On and off for about a year now, we've been working on a paper that tries to understand how changes to the US industrial structure have affected racial segregation between US workplaces. We think that decompositions of Theil's information statistic at the industry level, similar to what I talked about in the last post, give us a window into these changes. Details to follow in the paper we're writing.
We've had a problem, though. I wrote all of our original code for calculating Theil's \( H \) in Stata. I did this because I know Stata quite well, but it is terrible at handling recursion. This for example is the code I wrote for calculating a three-level decomposition. And it's just the code, with all comments and data-wrangling removed:
foreach race of global races {
    forvalues o = 1/9 {
        generate phi_aj\`o'_`race' = cond(w_aj`o' == 0, 0, `race'`o'/w_aj`o')
        generate E_aj`o'_`race' = cond(phi_aj`o'_`race' == 0, 0, ///
            phi_aj`o'_`race' * (ln(1/phi_aj`o'_`race')/ln(2)), 0)
    }
}
foreach race of global races {
    generate phi_aj_`race' = (`race'/w_aj)
    generate E_aj_`race' = cond(phi_aj_`race' == 0, 0, ///
        phi_aj_`race' * (ln(1/phi_aj_`race')/ln(2)), 0)
}
bysort year $area: egen w_a = total(w_aj)
foreach race of global races {
    by year $area: egen w_a_`race' = total(`race')
    generate phi_a_`race' = w_a_`race'/w_a
    generate E_a_`race' = cond(phi_a_`race' == 0, 0, ///
        phi_a_`race' * (ln(1/phi_a_`race')/ln(2)), 0)
}
egen a_tag = tag(year $area )
by year: egen w = total(w_a) if a_tag
by year $area: replace w = sum(w)
foreach race of global races {
    by year: egen w_`race' = total(w_a_`race') if a_tag
    generate phi_`race' = w_`race'/w if a_tag
    generate E_`race' = cond(phi_`race' == 0, 0, ///
        phi_`race' * (ln(1/phi_`race')/ln(2)), 0) if a_tag
}
foreach race of global races {
    forvalues o = 1/9 {
        drop phi_aj`o'_`race'
    }
    drop phi_aj_`race'
    drop phi_a_`race'
    drop phi_`race'
}
forvalues o = 1/9 {
    generate E_aj`o' = E_aj`o'_white + E_aj`o'_black + E_aj`o'_hispanic ///
        + E_aj`o'_other // occupation
}
generate E_aj = E_aj_white + E_aj_black + E_aj_hispanic + E_aj_other // est
generate E_a = E_a_white + E_a_black + E_a_hispanic + E_a_other // area
generate E = E_white + E_black + E_hispanic + E_other if a_tag // US
    by year $area: replace E = sum(E)
forvalues o = 1/9 {
    generate H_aj`o'_cmp = (w_aj`o'/w_aj) * ((E_aj - E_aj`o')/E_aj)
}
egen H_aj = rowtotal(H_aj1_cmp H_aj2_cmp H_aj3_cmp H_aj4_cmp H_aj5_cmp ///
    H_aj6_cmp H_aj7_cmp H_aj8_cmp H_aj9_cmp)
generate H_a_btw_cmp = (w_aj/w_a) * ((E_a - E_aj)/E_a)
by year $area: egen H_a_btw = total(H_a_btw_cmp)
generate H_a_win_cmp = ((w_aj*E_aj)/(w_a*E_a)) * H_aj
by year $area: egen H_a_win = total(H_a_win_cmp)
generate H_a = H_a_btw + H_a_win
generate H_btw_cmp = (w_a/w) * ((E-E_a)/E) if a_tag
by year: egen H_btw = total(H_btw_cmp) if a_tag
generate H_win_btw_cmp = ((w_a*E_a)/(w*E)) * H_a_btw if a_tag
generate H_win_win_cmp = ((w_a*E_a)/(w*E)) * H_a_win if a_tag
by year: egen H_win_btw = total(H_win_btw_cmp) if a_tag
by year: egen H_win_win = total(H_win_win_cmp) if a_tag
by year: generate H = H_btw + H_win_btw if a_tag
by year: generate HwithOcc = H_btw + H_win_btw ///
    + H_win_win if a_tag
by year $area: replace H = sum(H)
by year $area: replace HwithOcc = sum(HwithOcc)I think you'll agree with me that this is ridiculous. It doesn't help that Stata insists on calling its variables "macros" and has an extremely particular syntax for referencing them...or that it insists on holding one data set in memory at a time...or that its programming language has even thornier syntax for defining helper functions. Things like this are why I've been trying to learn R. I don't love R, but it has strengths that offset many of Stata's weaknesses, and it plays better with other software.
By comparison, at one point last year I wrote some python to build a hierarchical tree structure in which to hold data like these. Once I had that tree structure built, I could write helper functions and ultimately a function like this one:
def theil(tree, unit, levels):
    ''' Size-weighted sum of entropy deviations of a unit's sub-units. '''
    the_theil = 0
    if levels == 0:
        for subunit in tree.children(unit):
            the_theil += theil_cmp(tree, subunit.identifier)
    else:
        for subunit in tree.children(unit):
            the_theil += theil_cmp(tree, subunit.identifier)
            sub_weight = subunit.data.total() / tree[unit].data.total()
            the_theil += sub_weight * theil(tree, subunit.identifier,
                                            levels - 1)
    return the_theilHey, look, it calls itself recursively! The structure of the code reflects the structure of the computation! Obviously we should go this route!
This led to another problem, though. I can write code in python, but I can't necessarily write fast code in python. All of the functions I built to calculate the statistics were wicked fast, but the functions that actually built the data structure were unforgivably slow. A full set of analyses on our data, from building the structure through spitting numbers into a matrix, could take up to 40 minutes. Stata by comparison ran in less than 10.
That might feel OK. We're not writing production code, after all--we're academics! And code that's easy to modify, or generally to understand and work with after you've been away from it for a while, is really important. I learned this well in graduate school. The first time you get a revise-and-resubmit asking for further analyses, and you open up the project's code base and can't even remember which file calls which file, never mind what they each do...when you find yourself drawing a dependency graph for your own scripts, it's time to focus on writing clearer code.
All that said, though, 40 minutes was still unacceptable. In this project, we needed a way to distinguish what are basically random movements of the various components of the statistic from "real" shifts. There are no analytically defined standard errors for Theil's \( H \), though--what statistical papers exist on the topic basically amount to a flame war about how every prior attempt is garbage. The way to proceed under such conditions is to simulate a probability distribution with the data at hand, either through bootstrapping or via a permutation test. Yet these techniques require you to perform a given calculation not once but hundreds or thousands of times. If we were to go with 1,000 repetitions, our projected compute time shot from 40 minutes to more than three months...or more than three weeks if we used the thorny Stata code.
Once we realized this, we set the project aside for several months. This is reasonable enough when you have a tough problem and a busy job. In between, we literally collected and analyzed data for another paper, wrote a draft, workshopped it, and submitted it to a journal. So it's not like we were idle. But this one has been gnawing at me. Finally in January I wrote my co-author, Rem, and suggested that we book time at the end of February to sit down and finally figure this out.
Along the way, we decided to try this next in R. The python approach was fundamentally sound, but Rem's as fast in R as I am in Stata, and we could just write and trouble-shoot code faster that way. Plus, if we could speed things up in R, it would be easier to share with other academics.
Like all statistical languages, R is optimized for fast vectorized calculations. And so I wanted to work out as much of this as I could in those terms, ideally before I got to Cambridge. So, I had the idea to ignore programming languages entirely and to try to write everything as matrix math.
This took some reviewing for me. I've never taken a full linear algebra class, and my shakiness with parts of it, more than anything about statistics per se, was often what tripped me up in econometrics classes. But here it seemed the right tool for the job.
I ultimately wrote this document. When I'm working on research and not sure about my math, I'll often write documents like this one. I find that the rigor involved in typesetting math, however tedious, and then seeing my handwritten work typeset often helps me spot errors I've made. Much of it often ends up in an article's appendices, and it can be useful for future researchers to see how you did something, in all its detail.
This time, it really paid off. It's straightforward to write matrix math in R, and so translating from the document to code was comparatively painless:
p_agj <- (R %*% j)
p_ag <- (S_ag %*% (t(S_ag) %*% (R %*% j)))
p_a <- (S_a %*% (t(S_a) %*% (R %*% j)))
p <- (t(i) %*% (i %*% (R %*% j)))
w_agj <- p_agj / p_ag
w_ag <- p_ag / p_a
w_a <- p_a / p
phi_agj <- R / ((R %*% j) %*% t(j))
phi_ag <- (S_ag %*% (t(S_ag) %*% R)) / (S_ag %*% (t(S_ag) %*% ((R %*% j) %*% t(j))))
phi_a <- (S_a %*% (t(S_a) %*% R)) / (S_a %*% (t(S_a) %*% ((R %*% j) %*% t(j))))
phi <- (t(i) %*% (i %*% R)) / ((t(i) %*% ((i %*% R) %*% j)) %*% t(j))
E_agj <- -1 * ((phi_agj * tri(phi_agj)) %*% j)
E_ag <- -1 * ((phi_ag * tri(phi_ag)) %*% j)
E_a <- -1 * ((phi_a * tri(phi_a)) %*% j)
E <- -1 * ((phi * tri(phi)) %*% j)
d_agj <- E_agj / E_ag # This never gets used
d_ag <- E_ag / E_a
d_a <- E_a / E
d_agj <- ifelse(is.na(d_agj), 0, d_agj)
d_ag <- ifelse(is.na(d_ag), 0, d_ag)
d_a <- ifelse(is.na(d_a), 0, d_a)
h_agj <- w_agj * ((E_ag - E_agj) / E_ag)
h_ag <- S_ag %*% (t(S_ag) %*% h_agj)
h_a <- S_a %*% (t(S_a) %*% (w_ag * d_ag * h_ag))
h <- t(i) %*% (i %*% (w_a * (E - E_a) / E)) # Not used
h_agj <- ifelse(is.na(h_agj), 0, h_agj)
h_ag <- ifelse(is.na(h_ag), 0, h_ag)
h_a <- ifelse(is.na(h_a), 0, h_a)
h <- ifelse(is.na(h), 0, h) # Not used
H_g <- S_g %*% (t(S_g) %*% (w_a * d_a * w_ag * d_ag * h_agj))That isn't as pretty as the python code, but the parallels at different levels of calculation are much clearer, and eventually we'll wrap and collapse all of this into some recursively defined helper functions.
Two things to note here. The first is that, when we first tried to run this, lines like p_ag <- S_ag %*% t(S_ag) %*% R %*% j were causing our machines to hang and crash. We were getting ready to abandon this approach when it hit me: \( \mathbf{S_{ag}} \) is an \( N \times K \) matrix, and so \( \mathbf{S_{ag}S_{ag}}^T \) is an \( N \times N \) matrix. If, as here, \( N \gg 100,000 \), that thing can have tens or hundreds of billions of elements. Yet we don't need it! Our ultimate goal was an \( N \times 1 \) vector, so if we just changed the order of operations we could post-multiply the second term by the third and fourth, then pre-multiply that by the first. At no point did we actually need to create the gigantic matrix. 
We did that, and the operation went from crashing our machines to finishing in a fraction of a second.
The second was with how we generated \( \mathbf{S_{ag}} \) in the first place. We were using the fastdummies library in R. It looked like fastdummies wasn't fast, though. Generating these matrices took up to 30 seconds on our real dataset. After some searching, we found that we could generate sparse matrices with the matrix library...and what had been taking 30 seconds now took .3 seconds.
With changes like these, we sped up the computation 500 or more times. Our projected compute has fallen from more than three months to fewer than six hours.
I have two reasons for writing this all up. The first is that someone might find it helpful, both in its specifics and in the general sense of seeing how those of us with little formal training in computer science can still go about solving practical problems in computation. None of anything I did here is rocket science; it just takes persistence and rethinking how to approach the problem.
The second reason, of course, is that I figured something out in an area where I usually feel like a fraud, and I'm proud of it. It isn't a brag to share such moments with people. It's a chance to feel good about yourself, justifiably so, and have people validate those feelings. Given how often when doing research we feel rotten, it's important to mark the really triumphant moments, too.
 
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
            