Don Green: Threats and Analysis

Don Green: Threats and Analysis


welcome everyone to day 4 congratulations on making it through the first few days. before we start as always I’d like to recap what we’ve covered over the neck over the last three days. so the way I’d like to do it this time is I would just like to get a show of hands from anyone in your group who’s kind of randomization design has changed significantly since you did the power populations yesterday. one that’s one person. I know that you represent an entire group so you must have got it so is everyone else pretty okay with- they’ve done their power calculations and it seems pretty okay and/or how many people have not yet done power calculations for their project yet? okay so maybe those numbers will change a little bit. so would like to tell me some of the ingredients that go into our calculations anyone? anyone? sample size? okay and that seems like the most important, right? what other ingredients? yes, effect size okay that’s maybe the other most important. control variance? okay okay that’s pretty good um how about if we have a cluster design? intracluster correlation, right? alright good so now we all are kind of primed to be thinking about sample size, we all have now kind of designed our ideal experiments. everything is working perfectly on paper and now today is the day where we get to discuss how everything can go wrong. so it’s my pleasure to introduce Don Green Don Green joined J-PAL, is a J-PAL affiliate when we opened our North America office in 2013, but he’s actually been kind of very important to J-PAL since J-PAL’s been around. why? because he’s been one of the loudest evangelists for RCTs in the US that I know of. he is an economist but he works in political science department so he’s kind of- that’s a discipline that we haven’t interacted with as much before. we had to take down North America and actually one of the very first case studies that we created- you guys all did the learn to read evaluations that randomized evaluation one based on the read India program in India. one of the very first iterations of a case study that we had was based on the get of- get out the vote campaign and we- get out the vote project actually Dan Levy covered that a little bit as one of the examples of how how results can vary a lot by you know based on the method that you do. Dan Levy gave us and J-PAL credit for having done this analysis but actually that’s not true one of the original papers based on that that study was a methods paper to show exactly this it was a dual academic paper published in a journal about how about the different methods and how they’ll give you different results and that was Don Green. so the first time I interacted with him was about 10 years ago when I emailed him- I’m sure he doesn’t remember- asking him for the data and he happily gave the data we were able to reproduce all those different. so let me hand it over to Don and maybe you can tell a little bit about yourself and then moderated. thank you so much, thanks for the kind introduction. my name is Don Green I’m a professor at Columbia University, where I’ve taught for the last five years. for the previous preceding 23 years I was at Yale University, where I directed among other things, the institution for social and Policy Studies and that’s where we sort of got field experiments going in the 1990s at least in in political science. Reviving them- they had dated back to the 1920s, but interestingly in political science journals, not one single field experiment was published during the 1990s. so it sounds like this is a thousand years ago but things that have really changed quite markedly since then. I’ve done quite a few field experiments and have summarized roughly ten years of lecture notes that Alan and I have compiled over the many years we’ve taught these courses in our textbook called field experiments: design, analysis, and interpretation. and so what I’m doing today is giving you the super condensed version of what would ordinarily be about three weeks worth of lecture notes. so if you are scoring at home you would find the lectures for today in chapters 5 and 6 which are on on non-compliance, 7 which is on attrition, and 8 which is on spill overs, so with that in mind we’ll do the super Reader’s Digest condensed version today but feel free to interrupt with comments or questions as as they arise. we have lots and lots of material to cover and and lots of supplementary material if you need it. as was just indicated I get the unpleasant task of telling you about all the things that can go wrong when designing and implementing field experiments, but I’ll say that having- having probably screwed up more field experiments than anybody else, that you know, that there’s a- there’s a special satisfaction that you take when you anticipate the things that can go wrong and design accordingly and that’s why the very best field experiments typically are done in multiple stages where the early stages are pilot studies. so for example a study that I’ll describe in a moment from Uganda was done in two rounds: we completed the the pilot study in February of this year- where we’re back in the field in a few weeks actually to do the the bigger version of that study and I can’t emphasize how important it is if you were launching experiments to be part of a research team that has had some experience in the field- that has done some some dress rehearsals especially if you’re working with an NGO that doesn’t have a lot of experience doing that sort of thing. and I’m very happy to talk about the practical details of that kind of collaboration. at any rate all the kinds of catastrophes that we’ll be talking about today happen routinely in the field and so routinely in fact that the very first thing that you do when you’re- you’re designing an experiment is review the threats to-to-to inference that might arise as a result of implementation my-my setup for today will be to in some sense backup a few days in this class, talk again about course functions maybe you’re sick of hearing about them, but I think the-the theme of the field experiments book is in that textbook, we try to say, you know, remember three things as you go into the field. three core assumptions, not you know five hundred and thirty three things, but just three because you have to keep them in working memory as you make the kinds of real-time decisions that are inevitably going to be required in a field setting where you have to make big choices on the fly and then having designed things with those core assumptions in mind. you’re going to have to analyze the data accordingly and so the the theme of the book is in some sense you know develop a pre analysis plan based on the design and then once the data come back analyze the data based on the principles that are embedded in the design and then interpretation, well we’re going to talk about that today to a large extent. because in some sense it’s a reflection of the kinds of analytic errors that can easily arise when people get confused about what should be compared to what so with respect to our three topics are depressing topics of non-compliance, attrition, and spillover. non-compliance as I’ll I’ll discuss all of them in detail but basically to preview things non-compliance is when some members of the assigned treatment group do not in fact receive the treatment or some members of the assigned control group receive the treatment inadvertently. that’s sometimes called crossover. at any rate we’ll we’ll talk about what you can and cannot do once your your experiment encounters non-compliance which is really routine in field experiments and frankly fairly routine in other kinds of experiments as well. attrition refers to whether or not you have outcomes for your subjects so sometimes your subjects wander off and you’re unable to to measure outcomes this is especially true in cases where your subjects are no longer beholden to you after you’ve, for example, withheld treatment from-from them so for example in many of the school lotteries people enrolled, to be say, given a voucher to send their kids to private school and the outcome is for example in the Howell and Peterson studies a measure of student performance as gauged by a standardized test. well guess what the students who fail to win the lottery are less likely to take the follow up tests. you know they’re not really thrilled with the researchers who didn’t give them the-the winning ticket and as a result they’re less likely to provide outcome measures. and so that’s a case where where attrition potentially threatens the entire enterprise. spill overs refer to the possibility that whether or not one person receives the treatment changes how others might respond to the treatment or control condition to which they were assigned; so to what extent do the treatments assigned to some people leak over to the controls that are nearby. and we’ll talk about five different kinds of spillover is associated with a broad array of substantive applications. and then if we have time we’ll say a bit about generalizability I mean the whole field experimental movement is really designed to shorten the distance between the experimental design on the one hand and the policy and and behavioral recommendations on the other and yet no field experiment is perfect and as you think about field experimentation it’s very important to think about the the attributes of four aspects of generalizability: who are the subjects, what are the treatments, in what context do the subjects receive the treatment, and what are the outcome measures. and that’s not necessarily specific to field experiments but basically applies to social science more generally. so that’s where we’re going in the next hour and 15 minutes. so just to give you a sense of the kinds of things that I do most of my work over the last 18 years is focused on some form of political participation or political persuasion, but I also occasionally published in criminology psychology other fields. lately my work has been focused on mass media and so for example two recent experiments, one done in the national elections in India in 2014, looked at the effects of radio advertisements on all Indian radio stations, designed to discourage vote selling by people in the days before the election, and another study that i mentioned just a moment ago is about messaging Rwandan villages- sorry I’ve never wanted that’s a separate study- Ugandan villagers about the topics of domestic violence, abortion, and teacher absenteeism. so in this we’ll describe the studies very briefly. so in the India study there are 60 radio propagation areas of medium size and half are randomly assigned to air our ads which are basically a series of dramatized vignettes that poke fun at the absurdity of vote buying- you know trying to draw attention the fact that these are not gifts from politicians this is a way for them to basically feed at the public trough so they’re trying to bribe you in so they get access to public money so that they can essentially recoup their investment so that’s the the message and outcomes are measured at the assembly constituency level so these little triangles are actual vote outcomes and as you can see this is a cluster design because assembly constituencies are embedded within propagation zones but we’re not assigned to each assembly constituency we’re assigning each propagation zone and so there’s a classic case where the kinds of things we were discussing in previous lectures apply-you know this is a case where we have intracluster correlation and so we have some treatment assignments and some control assignments and some so called placebo assignments because India’s elections in that year were held in nine phases and in some cases we were airing ads after the election had already occurred in some. we weren’t doing that deliberately we were trying to to air them to the people who were eligible to vote and we’re about to vote but in some cases we couldn’t help but beam our ads to people who had already voted so that that makes for a nice placebo test as well so two outcomes for that study one would be voter turnout and the other would be vote share for putative vote buying parties. the Uganda study takes place north of Kampala in these rural areas this is the first round of a two round RCT. 56 villages were randomly assigned to have different kinds of messages: abortion, teacher absenteeism, or domestic violence messages embedded in a series of filmed vignettes with Ugandan actors. there are no subtitles and the actual vignettes these are all in the native language of Uganda which is spoke spoken just proportionately in this area so these are all dramatizations of social norms regarding these kinds of issues and it the randomization is done by embedding these messages these vignettes three on each issue in commercial breaks of a film festival so we invite people to a free film festival and the film festival goes on for four weekends and in the commercial breaks we play one or in some cases two of the of the vignettes on different themes so it’s a- it’s a design that will talk about later on because we have an interesting case of non-compliance here. we’re going to go back to the villages in six to eight weeks and interview villagers but some of the villagers will not have gone to the film festival and so we need a technology for dealing with non-compliance. we assign the village to the – – you know some kind of vignette but the villagers may or may not have seen it so so how do we work our way around that kind of non-compliance? so we’ll see the the the ways in which the analysis adjusts to that concern and then there’s also a spillover concern to what extent are the people who were not exposed directly influenced indirectly through conversations they might have had with people in the village who might have seen this? okay so that gives you a sense of what I’m up to: lots of media stuff lately so again the as the joke goes in the Ivy League’s there’s a distinction often made between books you’ve read and books you’ve read yourself, and so this can be kind of a book you’ve read but the idea of the field experiments book is to stress the importance of defining the demand in this case it’ll be the average treatment effect but it can be other things as well appreciating core assumptions that are going to guarantee unbiased inference or without which you won’t have unbiased inference conducting the data analysis in a way that’s consistent with the randomization procedure and then following wherever possible procedures that limit the amount of analyst’s discretion- so trying to tie your hands so that you-so that unconscious or conscious biases don’t creep in and then presenting the results in a-in a transparent manner, and I think that’s just in general a useful principle but it’s especially useful among field experimenters because very often we have an enduring relationship with our NGO partners and to the extent that that’s as transparent and open as possible so much the better. so showing them a set of mock tables ahead of time and saying you know this is what we will say about those tables when the data come back is not a bad idea, and many of my favorite colleagues do that sort of thing. ok so we’re gonna jump into the meat of the lecture now and I’m simply going to be reviewing some some basics of what you know. in-in Cambridge its sometimes known as the Reuben causal model but other other-others point out that this model dates back to the 1920s so it’s sometimes known as a potential outcomes model. so the idea of a potential outcomes model of causality is to ask you: okay so-so you have an experimental subjects for each one we might ask how would that subject respond if treated? how would that same subject respond if not treated? those are called potential outcomes because you know they have yet to materialize and only one will materialize: the person will either be treated or not. we can’t observe both so we want to know what the causal effect is but we can never observe it for any any given individual. what we’re going to try instead to do instead is we’re going to try to get what I’ll define is the average treatment effect: the average over the entire subject pool of differences between treated potential outcomes and untreated potential outcomes- that will be the the extent to which the subject pool shifts on average if-if it were going for- from a situation where nobody was treated to a situation where everybody was treated. alright so the difference then between the two potential outcomes for a given subject is known as the unit level treatment effect and then the average for the entire subject pool is called the average treatment effect. so when people talk about the average treatment effect or the average causal effect as the target’s demand that’s what they’re they have in mind. and just to fix ideas let’s imagine a hypothetical schedule of potential outcomes. now, it’s hypothetical because you’d never get to actually observe both potential outcomes for a given subject but let’s you know consider seven of our Ugandan villages and consider the you know policy support for some the measure say related to domestic violence and here’s in this column what would happen if treated here’s in this column what would happen if untreated so for example in the first village the untreated potential outcome is 10 and it rises to 15 if treated so the unit level causal effect is 5 and notice that in some cases the effects are zero, in some cases they’re negative, in many cases they’re positive-if we were to add up this column and divide by 7 we’d get 5 or another way to think about it is if we were to take all the treated potential outcomes and take the average we’d get 20 all the untreated potential outcomes we’d get 15 and the difference between is five so the two ways to think about then the average treatment effect is it’s the average of the unit level treatment effects or equivalently it’s the extent to which the world changes if the entire subject pool goes from untreated to treatment; from 15 to 20 is a gain of five. why why are we worrying about that? well because randomization works by drawing a random sample effectively of the untreated potential outcomes and a random sample of the treated potential outcomes and comparing the two- we can’t know the individual level causal effect, but we can ask in an unbiased manner under core assumptions, the average treatment effect so the core assumptions then, the- drumroll the first would be random assignment of subjects to treatments because then there’s- we’ve broken any statistical relationship between whether somebody gets the treatment and their potential outcomes all right the the the treatments are allocated randomly so there’s there’s no connection there’s no necessary tendency for people to get the treatment when their potential outcomes are especially high or not when they’re low and that’s the difference between a random assignment study any study that involves self selection and that to preface what we’re going to say about non-compliance is why we have to in everything we do compare only randomly assigned groups you must resist the temptation to compare the groups that actually take the treatment to the groups that don’t take the treatment that’s going to be the the main the the the human frailty being what it is that’s going to be the main way that people mess up the analysis of these experiments and it’s the way they mess up the design of the experiments to because they’re not understanding that everything is going to come down to randomly assign groups never groups that either take the treatment or don’t because those are not necessarily randomly assigned the next core assumption is non-interference so this is the idea that a person’s potential outcomes are stable we’re heartless of who in the subject pool happens to get the treatment or not so in other words the person’s potential outcomes are unaffected by how the randomization happens to come out if that should not be the case then you have an interference problem and we’ll have to deal with that with design and analysis and then finally and very importantly the the assumption of excludability says that the subjects potential outcomes respond only to the treatment and nothing else nothing that happens to be incidentally correlated with the treatment and this is sometimes known as a symmetry assumption I can’t emphasize enough how important it is to maintain symmetry and everything that you do by way of design and analysis so you know if you have enumerators gathering outcomes you have to make sure that it’s the same set of enumerators for treatment and control if if the men go to the treatment villages as enumerators and the women to the to the control villages that will introduce a violation of potential violation of symmetry don’t do different things to to measure treatment and control don’t don’t do things that would add an ancillary treatment beyond what you you you hope will be the treatment so define your treatment and then do not introduce extraneous treatments that are correlated with it is basically the principle and that’s why it’s important to for example do as much blinding as possible for example in the Ugandan case you know we we spent hours and hours and hours training our enumerators but we tried not to tell them what our hypothesis is so you know they don’t know about the the film’s they just know about the survey because if they knew about the film’s maybe that would introduce an unconscious bias in the way that they ask the questions okay so those are the three core assumptions absent from the list of assumptions nothing about normality we haven’t assumed that you know a normal response distribution please not necessarily well I guess it all depends on what your unit of experimental analysis is if you’re assigning say counties to a vaccination regime it could be one thing as opposed to individuals it could be another so what one way to think about it is is a person affected by whether other people receive the treatment or not if so there’s a third potential outcome between simple you know treatment and not treatment there’s this kind of murky spillover treatment and if so we have to model that we have to build that into our schedule potential outcomes so the herd immunity thing is it’s kind of a complicated case because one way of characterizing it is to say oh that’s a background attribute before our experiment begins and it would be applicable to both those who get the the vaccine and those who don’t but another way to think about it is that the treatment that the the the treatment effect depends on how many others around you have gotten the vaccinations and that would be a spillover problem because in some ways it’s like a it’s like a varying dosage problem as well as a as a spillover problem which is why sometimes this is called the stable unit treatment value assumption but we don’t we don’t use that term in our book because it it doesn’t really mean anything in English it’s it just comes through it through other discussions in the book okay so so we’re not assuming normality and we’re not assuming that we can generalize from our our inferences about the average treatment effect in a subject pool whether we can generalize is going to depend on on other assumptions and then for that reason we’re not necessarily implying that we’re drawing a random sample of subjects and then breaking them into treatment control and that’s why although I noticed that in the in the warm-up discussion we were using the term sample and sample size I try to avoid that term especially in the book where we’re talking about a subject pool there are two necessarily a sample of anything there in some sense our population we want to make inferences about them we like perhaps to make inferences to some broader set of concerns but I tend not to use the term sample unless I’m literally sampling them okay so that’s that and then I noticed that the new regime here at J-PAL is to ask a whole bunch of clicker questions during these slides so that’s what we’ll do now a non random sample leads to biased estimates of the average treatment effect true/false or don’t know so we’re looking for 20 responses there we go they’re mounting up here on the slide okay so now that we’re almost there what do you think what do I click on something to get to show the results I click on what advanced there we go well there seems to be a fairly high proportion of people saying true somebody who says true say why this could be a case of non-response all right well it seems to be the majority winner and probably for good reason so a non random sample leads to biased estimates of the average treatment effect well again it’s kind of a confusing question because it’s it’s about sample but but I would say maybe what this this question means is non-random assignment rather than sample I didn’t write the question so I’ll say if if the in hence a non random assignment I would say it doesn’t necessarily lead to bias but it leads to a thread of bias right it under undercuts that one of the core assumptions so we’ll make a note to revise the question for next time we won’t use the term sample and we probably would talk about assignment rather than sampling so non random assignment of treatment all right so a key result as you saw from preceding lectures is that when the core assumptions are met- all three of them- you get unbiased estimates of the average treatment effect by comparing the average outcome in the treatment group to the average outcome in the control group. and one way to think about that is if we were to think about all the ways that you could possibly randomly assign seven villages into treatment and control say okay we say our design is to put two of them in treatment and five in control there are 21 possible random assignments if we were to take the average outcome using that hypothetical schedule of potential outcomes for all 21 and take the average result that would be five the average treatment effect we get it on average of five some of our our guesses would be too high and some too low but on average we’d get the right answer and notice that that’s why people would will say things like experiments are fallible but in expectation they will give you the right answer, and the key phrase there is an expectation or on average. over all hypothetical replications under identical circumstances. not that any one experiment is going to get it dead on but on average, you’ll get right which is why it’s so important to replicate experiments. oops here comes another question. okay, random assignment will always give you the the true average treatment effect. all right let’s see, people say now people say false why false? because it’s just one of three core assumptions and besides which we’re going to see that random assignment can be undone in particular through attrition so we can randomize people into treatment control and then they can unrealized themselves write it right back out and when we analyze the data we’ll no longer have unbiased estimates. or we can have spillovers, or we can have a breakdown in symmetry- there are other things that can go wrong and goodness knows they do go wrong so let’s consider the first of the mishaps: non-compliance- a source of so much confusion. so sometimes there’s a disjunction between the treatment that’s assigned by the experimenter and the treatment that’s received and you know sometimes this happens because of administrative mishaps; you know you send your radio ads to different radio stations and sometimes they just don’t play them. they meant to play them but they just didn’t play them and this is actually something I often stress when talking about experiments. many of the things that go wrong in field experimentation have nothing to do do with the fact that they are experiments. the world is filled with bureaucratic incompetence; things don’t get done and and so it doesn’t have to do with the fact that you- you were doing things on a random- randomized basis, it’s just you know things don’t happen as you intent which is why it’s so important to stay right on top of the implementation of your interventions. in the case of our yeah Ugandan experiment, there are people who are watching to see that those films are being shown, despite their taking photographs to to chart the number of people who show up they are interestingly not white people because that would introduce another source of that would change the context within which people view the films and make the whole thing less unobtrusive. but suffice to say that they’re there in order to prevent administrative mishaps another source of non-compliance which is characteristic of much of my work that has to do with contacting voters at their door by phone or whatever is that you know you’ll knock on their door to try to give them the treatment- some encouragement to vote- and they won’t answer. maybe they’ve moved, maybe they’re dead, maybe they see you and they don’t want to open the door; but for whatever reason they’re not there often despite repeated attempts by you and so that’s not- it’s not as though those are bad subjects or good subjects or high turnout or low turnout subjects it’s just that they are gonna be- they’re gonna be subjects about whom we’re not going to be able to learn very much and that’s why we’re going to have to analyze the data in a way that zeroes in on a somewhat different estimate than the average treatment effect. another other kind of design is to encourage people to do things, but not force them to do things. so you can encourage people for example to go into a job training program but you can’t literally force them and so you invite them and you remind them and you cajole them, but still they don’t show up or some of the people you did not invite in the in the control group show up enthusiastically. you couldn’t necessarily prevent them from taking the treatment and so in that case you have an encouragement design but there’s going to be slippage- not because of some failing on the part of the experiment but just because that’s the way life is, you know, people are free to often make choices on their own and encouragement will sometimes work and sometimes not so it’s extremely important when discussing the term the whole idea of of non-compliance to keep excludability in mind and you will see why as we work through some of the very basic algebra of of our approach to identifying the the causal parameter of interest. excludability is going to mean that the only thing that affects people is the treatment itself and not some ancillary part- something that’s correlated with random assignment. okay so if it’s a job training program it’s what is the effect of actually going to the program? there can be backdoor effects, there can be other paths you know for example somebody could be affected just by being invited to to the job training program. and that’s a function of random assignment but we’re going to basically stipulate that the only thing that could affect them is whether they actually show up to the program, so it’s very important to kind of keep in mind that that’s a potentially fallible assumption. you’re gonna have to defend it if you conduct an experiment with non-compliance and try to estimate what we’re going to spen- and then and then you know to the extent that you can design things in a way so as not to tip off people to what their assignment is so much the better. sometimes in the developing world people will allocate development interventions through a public lottery which has lots of advantages, right? everybody knows it’s fair, everybody could see it, but now the treatment group sees that it’s won and the control group sees that it’s lost, okay, and if that has an effect on on outcomes then we’re we’re going to have potential outcomes that are in some sense distorted a bit by things other than our intended treatment. so how should we proceed? this is the canonical J-PAL animation. so in our treatment group they imagine we have participants and no-shows to say a training program in the control group we have non participants, right? they weren’t invited but we have what J-PAL folks call crossovers- people who show up even uninvited. what can you do? can you switch them? no siree, do not do that that would- that would lead to bias. another approach is to drop them and that’s also a biased approach and those are two quite common approaches, especially as one leaves you know the ranks of people who are familiar with this kind of problem among field experimenters. I’d say that the typical person who’s new to field experimentation- even people who have quite distinguished careers in experimentation- often get confused when they encounter non-compliance. so for example people who study mass media or say new media will be interested in the effects of going to a particular website well they can’t force somebody to go to a web site but they can encourage them to go to what to a web site. well it’s so hard to lay off- it’s so hard to be an experimenter analyzing those data and not compare the people who went to the web site to the people who didn’t go to the web site but that is not what you want to do, right? you only want to compare the randomly assigned groups. that’s the only thing you do. focus on the assigned treatment group and the assigned control group and make adjustments for the fact that there is non-compliance in the analysis; don’t change that what you’re comparing, okay? so very important: subjects who whom you failed to treat are not part of the control group. before we did our first experiments on voter mobilization in the 90s I think pretty much every experiment that encountered non-compliance did some version of what I just told you not to do- they either discarded the untreated members of the treatment group or they lumped them together with the control group- so do not throw out subjects who failed to comply and don’t reorganize the subjects according to whether they took the treatment. base your analysis only on the original assigned treatments because those are the groups that have comparable potential outcomes, right? because because of random assignment. so, oops- time for another little quiz question: your treatment groups for analysis- your treatment group for analysis is individuals assigned to treatment, who were actually treated- in all individuals who were actually treated, individuals assigned to treatment regardless of whether they were treated? clicker time or don’t know-don’t know it’s an honest answer. you know it’s perfectly fine. all right, let’s see what we got all right well people are pretty clear on that one. that’s correct you know you want to have- you want to have the assigned treatment groups in mind at all times. don’t drop or reclassify people. okay, so now we’ve we’ve blown through what is basically like a lecture and a half ordinarily when I teach this in the 13 week course version, now we’re going to do one little bit of algebra, but I will tell you don’t worry we are only going to use the mathematical operations of subtraction followed by division. okay there will be no calculus, no higher-order math, just subtraction basically with a little division, okay, but it is very important to keep in mind key definitions. so in this this set up and by the way, those of you who have taken more advanced stats classes will find that this is basically a very, very simple derivation of what you will know to be the instrumental variables estimator, but those of you who haven’t done that don’t worry it’s you’re getting at the same derivation. you’ll get the same intuition but the basic setup here is we’re going to divide our subject pool into two latent groups: two unobserved groups, and begin the setup: here is we’re gonna say well, due to random assignment the composition of the subject pool is the same in the treatment group as the control group. right? they just differ randomly in expectation they should be identical, so what are those two groups? well when we have one-sided non-compliance so in other words if we’re knocking on people’s doors and sometimes we reach them and sometimes we don’t but the people whom we don’t reach can’t get the treatment so there’s just one-sided non-compliance; some of the treatment group is untreated but nobody in the control group gets the treatment inadvertently, that’s one sided non-compliance, then we can define the latent groups to be compliers and never takers. compliers are people who would receive the treatment if they were assigned to the treatment group and only if they were assigned to the treatment group, so it’s important to use would in the definition because these are potential outcomes, okay? you have a potential outcome profile if you assign them to the treatment group they take the treatment if you assign them to the control group they don’t. those are compliers. the never takers, which I’ll define in a moment, are people who would never take the treatment regardless of whether they’re assigned to treatment or control. so if you knock on their door and they don’t answer and if you don’t, I’m not gonna very- they don’t answer so whatever you do they’re not going to answer and so it’s going to be impossible to learn about those people because they never take your treatment so what we’re going to do is, we’re going to model the expected assigned treatment groups outcomes as weighted averages of the outcomes in these two groups compliers and never takers, and then we’re going to make the assumption of excludability, and then we’re going to identify- we’re going to be able to set up an empirical mapping from our design to our data. we’re going to identify the average treatment effect among compliers, and that will be our target, okay? so we’re not going to learn about the average treatment effect among the entire subject pool when we encounter one-sided non-compliance: we’re just going to learn about compliers and that’s only if we’re willing to assume excludability. so this is the that’s where- that’s the roadmap of what we’re going going to do now. okay just to make sure everybody’s clear on the definitions: compliers are those subjects who would take the treatment if and only if they’re assigned to the treatment group? okay, a common- having done a zillion exams I can tell you, if you haven’t graded them, a common error is a complier is a person who takes the treatment well. it’s true that in a case of one-sided non-compliance those who take the treatment are compliers but that doesn’t define a complier- a complier is a person who only takes the treatment if they’re assigned to the treatment, okay? and remember there are compliers in the control group, right, who are not treated so just kind of keep in mind that a complier is a is a is a person who has that configuration of potential outcomes- they take the treatment if assigned of the treatment group, not if assigned to the control group, and a never-taker is a person who never takes the treatment, regardless of their assignment. okay so those are their potential responses. another quiz: compilers are: individuals who always take up the treatment, individuals who never take up the treatment, individuals who would take up the treatment only if assigned to the treatment group, or don’t know? responses are mounting, they’re coming in, so let’s say yes that’s right: individuals who would take up the treatment only if assigned to the treatment group, compliers are not those who always take up the treatment. they only take up the treatment if they’re assigned to the treatment group, okay? so here’s an empirical example based on this Uganda study that I keep referring to. so in the fall- last fall we assigned 56 villages to a series of films and in some of them were embedded videos about helping those who suffer from the medical complications of abortion. so abortion is- it’s not strictly illegal in Uganda but it’s not legal either and many people think it’s illegal. it’s just kind of you know in a state of limbo and many girls and women suffer very, very negative health consequences- sometimes including death- of- resulting from botched illegal abortions. they basically go to somebody, have an abortion and then they suffer complications usually infection so that’s what’s dramatized in the in the vignette in the intended treatment is exposure to this message. okay but unfortunately, as I mentioned not the whole village shows up. only some of the village shows up. so we have non-compliers: we have people who are in a village that was supposed to get our video, but they haven’t seen our video. so compliers are those who would be exposed to the abortion treatment if assigned to it if their- if their village were assigned to it. never takers are those who would not be exposed to it regardless of their assignment and the outcome here is is just one of many survey questions that ask about your willingness to help a girl who is ostracized because of an abortion. so I’ll show you the question in just a minute, but now I want to set up the model that is more general – than this particular example but can be used for a wide array of different applications that involve one-sided non-compliance so here’s our setup. okay it’s often said that okay well when you you have a field experiment, and you encounter non-compliance, those who take up the treatment could be very different from those who don’t take the treatment, and that’s right but we’re not going to be comparing those people, okay? so it’s very important when you encounter criticism about your experiment which encountered one-sided non-compliance, that you kind of keep your head straight and say, okay, yes, that is- that would be a source of bias but I’m not going to be doing that and that doesn’t mean that the entire study is worthless. I just need to develop a an approach that is- that- that stands up despite that that that concern so I’m going to develop a very very simple model. it only has four little parameters here and I’m going to show you that this model is going to allow us to identify the average treatment effect among compliers. without assuming that compliers and never takers are similar in terms of what they do. okay, so that’s- that’s gonna be the key thing: we’re gonna be agnostic about whether compliers and never takers are similar and yet we’re still going to be able to identify the treatment effect for compliers. so let P sub C be the probability that an untreated complier expresses empathy for those ostracized because of abortion; so in other words this is just the base rate for for a complier without treatment, and then P sub n is the same base rate among never takers. so they could be different, P sub C and P sub n could be the same, or different, we’re agnostic about that and then a is going to be the proportion of compliers in the subject pool. so that’s gonna be a function of the design, you know, if you try really really hard to get people to show up to your your films you’re going to have a higher rate of compliance and if you don’t, you’ll have a lower rate of compliance and it’s sort of context dependent, but one way or the other you’ll have compliers. you’ll have some share of compliers and then the thing that we really want to know about is the average treatment effect among compliers which we’re going to call T here. so here comes the model okay we’re going to model the outcomes in our assigned control group and our assigned treatment group, so not the people who actually take the treatment, just the assigned groups, okay? remember that those are similar in composition due to random assignment because they’re just randomly assigned groups, so how will they differ? well let’s work through the math. the expected outcome in the assigned control group is a weighted average; it’s a weighted average of outcomes among compliers and never-taker. so we say the overall average is a weighted average of the average outcome among compliers times the share of compliers and the average outcome among never takers times the share of never takers, and because there are only two groups it’s 1 minus the share of compliers so that’s just a simple- that’s just simple weighted average, stuff there’s nothing complicated about that. just say the overall weighted average is going to be- it’s going to- it’s going to include a blend of the complier average and the never-taker average so what is different about the assigned treatment group? what’s different is not among the never takers because they never take the treatment. what’s different is among the compliers, the compliers base rate is bumped up by some average treatment effect for the compliers- that’s where that T comes in. okay so again the overall average in the assigned treatment group is a weighted average among compliers and never takers but now the compliers have been affected- perturbed in some sense- by a treatment effect so that’s what makes them different. well now that you know that that’s what makes them different look at these two equations: if you subtract C Sub Zero from E sub one you get alpha alpha T-oops, we said no Greek and there’s Greek in there, my bad-from the previous slide was supposed to be an a I’ll change it back. there was a strict no Greek rule no it’s back to an a here okay so we have to change the alphas to A’s okay so e1 minus e0 through the miracle of subtraction gives you a times T so in other words, that’s the average treatment effect/ t and a is the share of compliers under core assumptions that reveals some interesting quantities that thing- that quantity- the overall quantity is called the ‘intent to treat effect’ or the ITT. you may have heard about this in reading about evaluations in developing developing economics or other kinds of areas the intent to treat effect sometimes called the intention to treat effect its the effect of assignment so you- you turn a blind eye to whether people took your treatment, or whether there were backdoor paths from assignment to outcomes, you just say what’s the effect of my assignment? what’s the effect of of telling canvassers to knock on doors. irrespective of whether they actually talk to anybody. just the effect of having that program. and sometimes that’s what NGOs want to know about. that’s what sometimes you know firms want to know about, they they say well I don’t really want to know the behavioral effects among compliers, I just want to know, you know, what did it do to outcomes? how many more votes were produced? how many more sales were generated? and so the ITT is often a- a meaningful- s demand it’s- it’s something that evaluates the success of the overall program in shifting views, but sometimes especially among academics the focus is on the behavioral affect- the average treatment effect among compliers. right why is the ITT big or small? well one way it could be very small is that almost nobody was treated. so I could have a big effect but almost nobody was treated so the overall effect- the net effect of your program could be negligible, so academics typically want to separate those two components of the ITT into distinct parts. and how would we estimate T? well we will get an empirical average of the outcomes in the assigned treatment group and we’ll get an empirical average of the outcomes in the assigned control group and we will be able to divide by the proportion of people we actually treat in the assigned treatment group. right we knock on doors in the treatment group and a certain share of people say “hello”, you know and the conversation starts and because the treatment group is randomly assigned we think, ‘oh well if we’d gone to the control group we’d get the same share in expectation’ so we get that number of a out of our data or in the case of- in the case of Uganda we know the share of people who show up to our films. we can count them and that will give us the proportion of compliers. so what then is our mapping of our model to our data to get T? well we say we have an empirical estimate of e 1 we have an empirical estimate of a 0 these are the assigned treatment group outcomes, these assign control group outcomes, we have an empirical estimate of the share of compliers and the ratio of those to II 1 minus e 0 quantity divided by a gives you T, and that’s the instrumental variables estimator. all right so it’s like a barroom bet. could I do instrumental variables regression. you know. in a hand calculator in under 30 seconds? the answer is yes if you give me those three numbers you can often do it in your head. okay just the difference divided by the share of compliers, okay? so that’s that’s it um and notice notice that because the P sub C’s and the piece of ends dropped out of the subtract. we can be agnostic about those things so again a critic who says, “yes but wait a minute those who take your treatment could be very different from those who don’t”, you say correct and that’s why I’m using this estimator which only compares the randomly assigned groups, right? we never compare the people who actually show up to the films to the people who don’t show up to the films, that’s the key thing. that’s why it’s worth going through the math, I mean I’m sorry to drag you through a little algebra but um fortunately it’s only like junior high school algebra, but it but the reason to do it is so you can see with your own eyes that there was never a comparison of the people who showed up to the people who didn’t show. you- you’ve hit a super important subject it’s a big subject and it it reveals the infirmities of this very simple design with just two groups. if you have gradations of partial compliance and you want to estimate the marginal effect of more time spent watching the vignettes you need to do something to vary the amount of time that people spend so you know maybe you do extra advertisements or something, you have a third group that is given more encouragement, so that on average they watch more vignettes but what you’ve hit at is a very important thing- this is going to give you the effect of showing up at all right because we have to crucially assume in this set up that the never takers are unaffected by the treatment right, so if we start if we- if we say oh well and one viewing doesn’t do anything well now we’ve- we’ve made an assumption here whereas our assumption is the people who show up cannot be affected so so you’re absolutely right, and the answer to that is more nuanced experimental design, okay? the in- these hat signs mean that these are the estimates, okay the estimated ITT always has the same sign as the estimated CAC so remember the CAC IIT is what we just called T okay that’s the average treatment effect among compliers the complier average causal effect is what the CAC is. we have the realize that the the slide comes in before we defined the term okay so you just thinking back to this set up does 80 have to have the same sign as T? well the critical thing is whether a has to be positive does they have to be positive that’s basically a way to think about this this question the answer would be okay, not sure, no. the answer is it is true because a has to be positive there have to be some compliers I mean if there were no compliers you’d have no experiments but but if you assume that there are at least some compliers, whose causal effect you can you can estimate then yes it’s true that that these two things have to have the same sign. not only that but just I didn’t really go into this but the statistical significance of the two quantities is almost always very, very similar so it’s not as though you’re doing something very different in terms of inference when you look at the the causal effect among compliers versus the ITT, because basically the you know the the active ingredient is the ATT in both cases. okay so back to terminology, I think the next time maybe we’ll put the question slide after this slide, so the terminology footnote is that what we’re calling the we’ll just called T is the complier average causal effect that’s what it’s typically called in the statistics literature whereas in economics the CACE, is almost always called the LATE, the local average treatment effect, but it’s not-it’s not local in a geographic sense, its local in the sense that it applies to the group of compliers but not to never takers. right those two things could be totally different: the average treatment effect in general for the whole subject pool could be very different from the average treatment effect among compliers. right yeah they they might show up because yeah they’re different, they’re more interested, they’re more, you know, they’re more open in terms their personality, it could be lots of differences among them yeah but they refer to exactly the same thing the local average treatment effect is the effect among compliers and the term LA- TE although it persists in economics is is gradually waning, and the term from statistics is gradually taking root. note that for the subject pool segments who are compliers for them the CACE, their average treatment effect is equal to the ITT, because they do what they told- they’re told: if they’re assigned to the treatment, they take the treatment, so there’s no disjuncture between their assignment and their actual treatment so their intent to treat effect the effect of assignment is exactly the same thing as their average treatment effect and then the final note is among never takers: the effect of assignment is zero because no matter what you do you can’t get them to budge, so assigning- you can assign never takers all day and you never see a change in outcomes because they never take the treatment. okay so now let’s see how that plays out in this empirical example again. there’s assignment at the trading centre level so we’re going to have to cluster our standard errors when we analyze the data there are 56 clusters it turns out incidentally that the level of interest– er correlation in Ugandan villagers is very, very low which is to say that these villages are quite homogeneous. it’s as though the villagers from the standpoint of these these outcomes were randomly sprinkled across villages they don’t- they don’t really have much sorting from village to village. fortunately for us by the way we-we were able to to know that ahead of time for planning purposes because we did a pilot survey so before we did the sur- the big study we did a survey at the village to get the bugs out of our measurement devices and noticed that there’s a weak intra cluster correlation. okay so in terms of notation, typically in these kinds of models the assignment is denoted Z and the treatment is denoted D that’s typical for all the Harvard, MIT papers, typically call the treatment D and and the mnemonic there is whether the treatment was delivered for D delivered so one if you’re actually exposed to it and zero otherwise and we defined exposure very broadly for purposes of this-this analysis so this is you showed up, or your family member showed up, so-so we’re basically saying the only people were excluding are the people who were totally unaware of it had no connection to it directly or indirectly and then why is just the willingness to help a girl who’s ostracized on account of having an abortion. so this is the question suppose that a girl in your neighborhood has had a deliberate abortion, and we randomly vary whether she wants to stay in school or wanted to take a full-time job as the pretext for why she had it. she’s been ostracized; two of your friends make the following statements, with which friend do you most agree: ‘she made her choice and has violated God’s rule it’s better not to get involved’ or ‘regardless of what this woman did we should try to help her’? so these are the translations of the question. so here are the outcomes so this is the effect of assignment Z on treatment D so we can imagine up a path model very very simple where you have Z affecting D affecting Y. so this is the assignment, and this is the treatment, and this is the outcome. and now you know why I don’t really write on the board very much so-so basically, the idea is the only way that assignment influences outcomes is through the treatment. so your village is assigned to get a treatment and the only way that could be influencing you is if you actually saw the film. okay so what this shows is that but 68% of the people in these villages who were assigned to the treatment videos actually saw them, and of course nobody saw them in a control village just they couldn’t, and then now-now comes the the ITT part. ok in order to compare-in order to compare the results, in order to get the ITT, you have to look at the average outcome in the treatment group: the assigned treatment group, and the average outcome in the assigned control group, and compare them so about seventy five point six percent of the the assigned treatment group said, ‘we should help the girl’ as opposed to 70 point eight in the control group. okay so that difference is in proportion terms 0.048. one i just wrote down that number and detail because i’m going to show you how you can get it through regression in just a minute but that is the simple difference between these two numbers so the abortion films seem-seemed to increase empathy for these these women and girls when when these people were interviewed eight weeks later, but, there was no effect of the domestic violence. that one did have an effect so then how do we estimate the complier average causal effect? well we take our difference between a 1 and a 0 that’s our ITT, and we divide by a the share of compliers 0.68 to get 0.0705. so what does that mean? it means that among compliers-among compliers, that subset of people who would go to the film if and only if their village is assigned to the film, actual exposure to the treatment increase their expressions of empathy on the survey question by seven point one percentage points. all right so it’s not not day and night but seven point one percentage points interestingly is larger than the standard deviation across villages, so it’s like moving people from one of them- one of the average villages to one of the most supportive villages. that is equivalent to the so-called instrumental variables regression estimator, and I’m going to show that in a minute, and notice again that we got this: we never compared those who showed up to the film to those who didn’t show up but we didn’t even need to know who in particular showed up that was all irrelevant. we just need to know the rate so how would you do this if you were using Stata to run a regression? you would regress Y on Z in order to get the intent to treat effect, but you’d have to cluster the standard errors at the trading centre level. so it’s gonna have 56 clusters so those will be our robust standard errors, the coefficient should look familiar- it’s exactly what we got before just by taking a difference between the assigned treatment and the assigned control. so all I wanted to demonstrate there is that there’s absolutely no difference between reading a table and reading a regression: they’re exactly the same. what-why do a regression then? well because it’ll give you a standard error, and the standard error will give you a p-value and the confidence interval and in this particular case you would conclude that there’s about a three percent chance of obtaining a number as large as what we actually got, or larger, by chance if the true effect were zero. so if the true ITT, were zero it’d be pretty unlikely that we would get an estimate this large. okay that’s one interpretation the other-the other thing we could do is to run an instrumental variables regression, so that would be Y on D instrumenting with Z, in order to get the complier average causal effect. so it’s IV regress Y where D is a function of Z in other words Z is a random assignment- it predicts treatment. it’s random so it’s by assumption again unrelated to all of the unobserved features of of the outcomes so or unobeserve causes the outcomes I should say so that number is exactly what we got when we computed it by hand it’s the ITT divided- the contact rate divided by the share of compliers so 0.0705 again. standard error is clustered at the trading centre level, the p-value should be very very similar, and is because that was about two weeks of lecturing right there in the last twenty three minutes so on to the next-on the next topic okay so that would be a case where you’re-you’re trying to address non-compliance by keeping track of what your share of compliance was. you just had to-all you do is measure the share of compliance another approach is to change the design to make it a placebo design. so this is a case where you have an inactive placebo- has no effect on outcomes it basically makes sure that the same complier group that shows up to your treatment also shows up to your non treatment. so in our case we have a placebo control design we have a film festival people show up in the control villages to a film festival, that is in every way identical across villages. they don’t show up to see the commercial breaks they show up to see the films. so they show up to the films and we know who the compliers are in the control group, so rather than compare the assigned treatment to the assigned control we could compare the compliers in the assigned treatment to the compliers in the assigned control. so that’s the setup here, we want to make sure that you know nobody’s tipped off to what condition they’re in, that the compliers are exactly the same in both-both setups but the advantage of doing that is it can improve the precision with which we estimate the CACE. it’s exactly the same as demand, but now we can estimate it with greater precision so we measured who actually attended and who had friends who attended and neither. okay the compliance types are revealed in both the treatment group and the control group. assuming symmetry right- this is the critical thing, you’ve got to assume symmetry. you got to do everything exactly the same for the treatment and the placebo groups and a way to check whether you’ve done it right is there no effect on never takers, except a spillover effect, perhaps, but those who don’t show up should be very similar and so indeed when we look at never takers their results are almost identical in the treatment and the control condition so the only reason they would be higher is if there were a spillover effect so these are the people in treatment villages who didn’t show up and didn’t have friends who showed up and these were the folks in treatment villages who did show up or had friends who showed up and again we see that there’s about a seven percentage point effect there so it’s another way of estimating the same quantity the effect for compliers and we can do that again using regression- we just regress treatment a site-we regress outcomes on treatment assignment, subsetting for the set of compliers so these are the people who actually show up. so among compliers we get the effect of about 7.1 percentage points but notice that now our p-value is much much lower because we’re able to estimate that more precisely thanks to the placebo design. so if you have non-compliance, and it’s feasible to do a placebo design, it’s very often a quite an effective strategy and that’s why we were careful to have these different issues that were sort of unrelated to one another. teacher absenteeism versus abortion, for example, because we knew that we could-we could match them together so one design suggestion is if you’re working with other NGOs, pool your RCTs together and have one RCTs treatment be the other RCTs placebo, okay? summary: failure to treat changes your estimate you can only ‘guesstimate’ the average treatment effect for the whole subject pool. instead you can go for things like the compiler average causal effect. remember the compliers are those who take the treatment, if and only if assigned to the treatment group. beware of extrapolation because the CACE could be very different from the average treatment effect. it could be you know if you’re doing a job training program that the people who would show up to a training program voluntarily vs invited have very different treatment effects on average then everybody would if they were forced to to show up, so extrapolation is is potentially dicey. another interesting quantity is the average treatment effect of assignment. the intent to treat effect because that tells you something about the effectiveness of the overall program okay here comes another J-PAL quiz item the the estimated CACE and the estimated ITT effect will be the same when a) there’s perfect compliance b) when the ATE is zero, C) amongst compliers, or D) answers A and C. cool we’re all over the place all right well I would say it is dich- growing why is it not B? why is B wrong here? what’s wrong with B? all right, the-the-the thing is that the yeah, every treatment factors for the subject pool as a whole right, so suppose that were positive for compliers a negative for never takers then the ITT would be positive even though the ATE is zero, okay? but the ITT would be weighted down by the by the never takers. who have an ITT of zero, so-so they won’t necessarily be the same, but they are the same in perfect compliance. because then assignment and treatment are the same and they are the same among compliers because they do what they’re told so the correct answer is A and C, okay? before we leave, the topic- remember the assumptions behind the estimation of the CACE there’s no back- door path, so you know a classic case of a backdoor path violation is restorative justice. so the restorative justice just to make it very, very brief is this idea that victims will feel better if the perpetrator of a crime against them apologizes, okay? so an experiment by Sherman and Straying sought to test that- that-that hypothesis by having real criminals who had been convicted apologize to their victims, so some were assigned to be apologized to and others not, and in the very early rounds of the study they found that sometimes the perpetrator would show up for the meeting and then would not actually issue an apology. now those people are treated, but not treated as intended so you couldn’t necessarily say there’s no effect of the assignment to be be treated even though they didn’t get the treatment they were expecting to get. so that’s a case where you can’t just say well, what share of the treatment group actually got an apology because they could be influenced by something other than the treatment and in subsequent experiments they they tighten that up to prevent that problem they sort of pre-screened the perpetrators to make sure that they were ready to apologize. so that’s that’s an issue, you have to define the treatment and measure whether subjects receive the treatment so this is one of the key things if you’re doing an experiment that’s going to encounter non-compliance. you want to measure what proportion of the people actually got your treatment and then here we’re for you know to save time, we’ve only talked about one-sided non-compliance but chapter 6 and the textbook deals with two-sided non-compliance where some people are going to be treated inadvertently and that’s more characteristic of encouragement design, so I put those in the back of the slides in case you’re interested I’m gonna skip the review because we’re short on time. I’m gonna go on to the next even more depressing topic, attrition. so attrition probably the most depressing topic because it can upend even an extraordinarily elegant and lavish study. perhaps the most famous and contentious example is the RAND health-insurance experiment conducted in the 1970s, by a team of researchers both at RAND and at Harvard. this study cost in real terms, more than 300 million dollars. it’s an extraordinarily lavish study that basically gave people different reg-regimes of health insurance they were enrolled, and were either given full 100% health insurance or some gradation, going all the way down to 5%. well not surprisingly the people who got 5% were much more likely to drop out of the study and so the problem with with that study is at the end of the day they noticed that there is no health difference in outcomes between the group that stayed in the study and whose health outcomes were measured that got a hundred percent full-full coverage versus five percent but what if the people who were sickest in the 5 percent covered said no way am I gonna be in the study, I’m leaving this study because I-I could be sick, right? so if the sick people are dropping out of the of the low coverage option, then we will distort you know the comparison between high and low coverage rates and in fact those people will have unrando- mized themselves so the the sharp differences in nutrition rates between treatment, control are actually a threat to the core assumption of random assignment. so that’s that’s the the key-the key issue, does attrition threaten our ability to estimate the average treatment effect among the people who would report if they were if they were asked to provide outcomes? so one of the things to think about that if you encounter attrition is whether it threatens symmetry, and there are two things you could check immediately. are the rates of attrition, the same and if the rates of attrition are different, as in the case of the school voucher study that-that sends a parade flag. it doesn’t necessarily mean there’s bias because in the end you never know, you don’t know anything about the people who slipped away unmeasured, but it certainly suggests that there’s-there’s something going on and, even more problematic would be if co-variants predict missingness differentially in treatment and control. so if the pretest scores have a different relationship in treatment and control, that suggests that there’s a there’s a sense in which people are dropping out in a way that’s likely to be related to potential outcomes in one of the groups, which is a total mess because again that’s a symptom of-of people unwise. so you’ve got to be very, very careful if either of those symptoms appear in your dataset which is and it’s-it’s a problem so debilitating that it’s important to think about attrition first before doing any further design. think about how you’re going to measure outcomes first. okay so what kinds of things might cause attrition. well sometimes it’s the experimenter- it’s not uncommon, for example, in psychology for people to conduct lab experiments, debrief the subjects in the treatment group asked them whether they sniffed out the hypothesis, and then if they say yes they they get discarded from the analysis. okay that’s that’s gonna break the symmetry between treatment and control because you’re not doing something similar in the control room. you’re not- you’re not giving them the treatment afterwards and seeing whether they sniffed out the treatment, you’re not doing anything with them, so you’re breaking the symmetry almost by design. it’s also quite common in studies of donations, or purchasing to throw out the people who don’t purchase anything, or who don’t make it a donation, just focus on the people who made positive donations also possibly biased because what if the treatment affects whether you donate, right? if it affects whether you donate then the amount that is donated on average by people who donate something could be quite misleading. so don’t do that maybe this is okay you know if sections of your dataset go missing for seemingly- your-you know, unrelated administrative reasons. well maybe that’s not going to be a problem if those reasons are not plausibly connected to treatment assignment in any way. so what kinds of things can you do to rescue an experiment that has suffered from a possibly debilitating attrition? you know, one approach that is not used often enough- it has been used from time to time, including for example- I think the moving to opportunity study. we’ve been using it more recently in our lab is double sampling. so this is drawing a random sample of the missing in both treatment and control, not, you know, trying to get everybody but using an intensive effort to get some of the missing to figure out whether they are very different from the people who who actually furnished answers in the first round. so if you use your resources intensively and get a random sample of the missing, you could use that group to fill in the the missing piece of the of the treatment group and the control group that occurred when people dropped out of your study. I like that approach very much, and it’s not done often enough and it works well in the developing world where you know very often through intensive efforts you really can get people to to provide answers. another approach which is often very depressing because it tends to lead to a huge amount of uncertainty, is to use worst case bounds- to use extreme value bounds which is to say the highest and lowest possible outcomes. so you give the highest possible outcomes, to the control group, the lowest possible outcome to the treatment group to see the the lowest possible effect, and do the reverse to see the highest possible effect, and the problem with that is that if you have any appreciable level of attrition like, 25, 30 percent attrition or more ,that just sends your bounds out to-to a range that is, from a policy standpoint, pretty useless. another approach also discussed in Chapter 7 is so called trimming bounds. I won’t go into this in detail but it’s- suffice to say that if you had differential rates of attrition in treatment and control you could sort the group, that has the extra subjects from highest to lowest and lop off the the extra subjects at the top of the distribution or reverse sort them lop off the people at the bottom of the distribution to see what would have happened if if those people had not responded. so that that kind of trimming approach is is a way to estimate the average treatment effect among a subset of people who would always report regardless, of their treatment assignment. it’s it’s often less depressing than extreme value bounds but it-it targets a different quantity, this is the average treatment effect as a nest whereas this one is is just focusing on people who would always, report regardless of their condition. but beware of dropping missing observations-beware of dropping blocks of observations because missing. this pops up, those are prone to bias. one last thing before leaving the subject of attrition, is missingness on the outcome side is very different from missingness among covariance, among control variables. if you have missingness for covariance, you should not drop those observations you should figure out a way to impute some some missing values for those observations because covariates are just gravy; they’re not required for-for causal inference or unbiased causal inference. so you-you can save your-your respondents- your subjects, prevent them from dropping out just by imputing some kind of value. so for example you could use in a regression analysis a dummy variable for whether somebody’s missing or not on that covariant and then give some arbitrary value to that missing value in the in the variable itself but don’t throw out observations because they’re missing covariant values. I’ll skip my review only a few minutes left. last problem, interference so the J-PAL animation basically says what if the the treatment leaks into the control group but really could be any kind of- any-any kind of interference between units, it doesn’t necessarily have to go from treatment to control it could go from control to treatment, but the key question to ask is are the subjects potential outcomes a reflection only of whether they personally receive the treatment, or could it be that subjects are affected by the whether others around them receive the treatment. if-if the latter, then you have potential non- interference violation. where does that arise in social science? well we just mentioned the case of vaccination and contagion so anything that’s contagious could potentially allow a treatment to spread from one person to the next anything that’s displaced so for example I’m currently collaborating with Chris Blackman and a team of researchers on a crime experiment in Bogota and that’s a kind of classic case of displacement because you have hot spots policing- you you have lots of places that the-the police have identified as high crime areas, and they want to allocate more police work- resources to surveillance of those places. well what if they’re pushing the criminals awa- you know, away from the hot spots, into adjoining areas. now the treatment area goes down in crime, but the control area goes up in crime. we’re no longer comparing the treat-the treated potential outcomes to the untreated potential outcomes, we’re comparing the treated potential outcomes to the spillover outcomes, and we could exaggerate our treatment effect there. that’s also true if you take police resources away from one group and give it to another group. again you’re no longer comparing treatment to untreated you’re you’re comparing them to in some sense that negatively treated. same with communication that basically works the same way as contagion. social comparisons: I mentioned the case of people who lose lotteries, they can become jealous of those who win and then if you’re looking at a within subject design so you’ve got the same subject it’s being tracked over time one threat to the interference assumption or non interference assumption, I should say is what if they anticipate the treatment that they’re about to get. what if they have lingering effects or remember the treatment they just received then there won’t be a sharp break between treatment and control. those are all ways in which spillovers occur. I would say the biggest complication that I see, the biggest threat to inference that I see routinely, when people try to analyze data that involve spillovers is they forget that equal probability random assignments of units that does not imply equal probability of sign of exposure to spillovers. so imagine we have villages and villages on a road okay here’s the road and here’s village A, and B, and C, and D, and E, and suppose we’re going to randomly assign one of those villages to get a health clinic, okay? and suppose we have three kinds of potential outcomes. you’re directly treated, your adjacent to a clinic, or you’re neither directly treated, nor adjacent. well random assignment of one of them gives all of them the same probability of getting the clinic, right? it could be here or it could be there it could be anywhere. but adjacency, well the probabilities are gonna differ from place to place, right? the ones on the ends are less likely to be adjacent than the ones in the middle. and that’s the big thing that people are forgetting when they’re analyzing data that involve spatial spillovers. that the way in which the villages or whatever are configured spatially will have a big effect on whether they could be exposed to spillovers. so you have to be careful about doing anything that compares unweighted difference in means you have to reweight the data by the inverse of the probability that they would be exposed to the treatment which is a thing that I’ll demonstrate in just a minute so in one of the-one of the studies that I’ve done I’ve-I don’t- everything is a very profound study it’s incredibly difficult to do I’ve done the study four times because it’s so difficult to do and we can’t quite get a bead on what the answer is exactly because it’s hard to hard to pin down, is lawn signs, so. but it’s a good example of a kind of communication experiment that is of general applicability so suppose you want to know the effects of putting signage in some places as opposed to others. in the world of that- I don’t have it- which is politics, this is done all the time, right? every time there’s any kind of election, all kinds of signs pop up on people’s lawns or on roadways and you want to know, you know, does that actually generate any votes? so the idea is we take precincts in- we’ve done in Congressional districts, we’ve done it in gubernatorial races, all kinds of differences- you take these precincts and you randomly assign them to get 40 or 50 lawn signs planted by the roadways and you just have treatment precincts and control precincts, but the problem is what if a person drives past a bunch of signs on a major thoroughfare and then goes-goes to vote in some nearby precinct? now your your treatment has been some sense traveled from treatment to control so in order to model that somewhat imperfectly we created a setup that had three different potential outcomes. there was directly treated, there was adjacent to treated, and there was neither directly nor adjacent. so if you didn’t have at least a half mile border with something that’s treated you’re considered untreated. so that’s the that’s the set up now you could imagine a more complicated set of potential outcomes you could say what if your adjacent to adjacent, or adjacent to adjacent to adjacent- there’s almost no end to the amount of complexity that you can bring to bear. but of course the more complex that your-your setup, the harder it is to estimate. so what we did is we did our random assignment and we needed to figure out what the probability of exposure to each of these treatments was so we repeated the random assignment a very large number of times and then recorded the the share of times where each of these precincts was either directly treated or adjacent to treatment or not treated at all and we created these kinds of heat maps and then we weighted each of these conditions by the inverse of the probability that it would occur so if you’re in the treatment you you’re weighted by the inverse of the probability you’re in treatment if you’re an adjacency you get the inverse the probability you’re an adjacency or same with concur control and then we ran a weighted regression using two different kinds of outcomes: one was the Congressional vote margin- so how many votes did you win by, and the other was the Congressional vote share, and we controlled for covariates which are past voting outcomes this is a congressional district and precincts tend to have very, very stable voting outcomes in congressional general elections, so the r-squared jumps from 0.03 without covariance to 0.82 with covariance, so that brings down our standard errors quite a bit, but still it’s really murky, really, really murky so so it looks like signs in your precinct raise your vote share by about two and a half percentage points, but with a 2.0 standard error it could be nothing or it could be something big. so that’s pretty murky and-and also murky although slightly less murky, is the effect of spillover, which looks positive one point eight percentage points, but with a one point eight percentage point standard error so we had to do the experiment again, and again, and again until finally we accumulated enough evidence to have have some kind of-it has something to say but we have concluded. but that’s that’s the approach to spillover it’s not just you know throw in a dummy variable are you near a treated unit because if you do that you’re prone to bias because you’re conflating the units that are proximal to other units with the with the treatment so keep your your eye on the ball when it comes to spill overs unless you specifically aim to study spill overs or displacement, you probably want to design your study to minimize interference. so we didn’t want to have interference across villages in Uganda so we arranged our our villages to have at least a 5 kilometer buffer from one village to the next it. didn’t prevent people necessarily from going from one village to the next but it meant that it wasn’t that easy to do so, that we didn’t have have our treatments bleeding over from one village to the next but if you do seek to estimate spillovers you know remember that you might have to do something fancier than a simple comparison of means. you might have to use-you might have to relate the data or you might want to do something akin to what we did in Uganda, which is to measure different gradations of compliance, and use a placebo control design so that we can compare the people who showed up to the people who showed up, the people who were friends with people showed up the people who are friends to the people who were neither showing up nor had friends in treatment in placebo. okay last topic: drawing generalizations earlier I mentioned that there are four considerations regarding generalizability: subjects, treatments, contexts, and outcomes, and notice that each of these issues play a role in the choice of experimental design that-that is made. for example, when people decide to do a field experiment as opposed to a survey experiment as opposed to a lab experiment there are many telling- quite telling lab experiments and sometimes they are done in the field right though the go to some exotic place and do a some kind of behavioral game in some exotic- trying to eliminate the concern that the subjects are not the subjects of interest but quite often lab experiments are done in undergraduate subject pools, usually associated with psychology classes. how much that matters is a matter of contentious debate- the treatments well, you know, very often the treatments that are being evaluated are the treatments that are deployed by actual governments or actual NGOs but sometimes they are the treatments that are contrived by academic researchers and again it’s a matter of what do you want to know? do you want to know what would happen if an academic researcher were to design and implement something? or do you want to know how things tend to operate you know in the world, in the wild? again it’s a matter of taste with respect to context. you know I’d say here unobtrusive misses is a big part of the choice among different designs. if people are in a study and they know they’re part of a study that could affect the way in which they think about the treatment. they-they reflect on the treatment, they absorb the treatment. whereas if it’s more unobtrusive, or their outcomes are measured unobtrusively, maybe there’s less of a threat of bias associated with them giving socially desirable answers are doing things especially in survey outcomes that might reflect the lack of symmetry between treatment and control. but but by and large I would say the the you know for those of us who study the media, very often the main concern is about the the context- you know it’s very difficult to make people watch TV in a naturalistic environment. it’s true that you can have MTurk subjects watch your ads but they’re watching your ads as paid workers- they’re not necessarily watching your ads in the way that an ordinary person would watch an ad, so-so again it raises the question of whether the context limits generalizability. and finally outcomes, not invariably but typically, in the lab, outcomes are measured right then and there you know you-you do your intervention, you measure your outcome, and before people slip away that’s it, you know-you-you want to get them to give you your outcomes before they go, and rarely our people measured day or a week later. Betsy Levy Pollock at Princeton and I are doing an update of our review of the literature on prejudice reduction, and if you look at the more than 900 studies that have been done on prejudice reduction, and winnow them down – which ones were randomized in a field setting, okay so that’s that’s a little bit more like treatment in context and then then further winnow them down – which ones measure outcomes at least one day after the treatment, you get down to fewer than 30. so it’s- this is-this is a huge hurdle from from any lab-like studies from policy standpoint. you want to know-you-do the effects endure at least one day? and-and presumably, you’d only want to be taking action based on experiments if you had some confidence that the effects endured at least that long so anyway those are those are considerations you know I would say your immediate aim when you do a study is to estimate the effect in your subject pool but eventually you want to replicate the study using different subjects, different treatments, different contexts, different outcomes, to explore systematically sources of heterogeneity and this is why, you know, that the first thing that a kind of true experimenter says after completing an experiment is ‘let’s do another experiment!’ because you want to know okay well how would it have come out if we had done some done things somewhat differently and increasingly now as field experimentation takes root you see collaborations across very different contexts. so-so once once you’ve done an experiment a few different times, then the next task is to do it in parallel ways in different contexts. which brings me to, you know, my- my next point which is if you want to study the person- that if you want to compare in a rigorous way the effect sizes over time or with different kinds of treatments, you should be drawing again, and again from the same subject pool. one way to do it is to have a giant control group, and relatively small treatment group, and you can keep dipping in randomly into the control group and treating that-that group because they are in expectation the same as the treatment group, so if you have the notion that your treatments are diminishing in effectiveness over time as people you know, they become more accustomed to technology or whatever, if you had a large control group into which you could dip, then you’d be able to do that more rigorously and increasingly. I’m a fan of having the same treatment be a benchmark in study after study so that you can calibrate the size of your effects against something that is known. I think this is gradually taking root Less so in my area of campaign experiments, but I think in other areas as well, for example, in job discrimination experiments I think it’s becoming increasingly common to have have some calibrated effect sizes against which other effect sizes can be compared. so how does-how does, you know, say the the racial discrimination effect compared to the gender discrimination effect compared to the effect on on saying other kinds of qualifications? all right let me stop there. I have more slides, we have so many slides, we were gonna stop there, I think in the interest of-of not overdoing it. open it for comments and questions and then I guess there will be a kind of discussion group exercises after that so let me-let me pause now and stop yapping. please specific recommendations of you know things to keep in mind or consider when working with that type of data and knowing that it may be able to overcome some attrition challenges but what are the key points? it’s a great question okay so the basic question here is you know what special considerations go into the the use of administrative data for outcomes and I would say first of all try to do a pilot study to see that you can get the administrative outcomes when you request them another thing is to be aware of the fact that some administrative outcomes are extremely sensitive, especially if for example a government gets wind of the implications of what you’re doing. so a famous instance of this is the Collier and Vicinte study of voter intimidation, I think in Nigeria. so they were going to use actual outcomes, actual vote outcomes until Nigerian election officials said no way are you getting outcomes and-and Susan & Hight and their first election monitoring experiment in Indonesia I mean she scraped the polling results that night and they were gone a day later so just beware that sensitive administrative data have to be gotten you know, basically instantly, and if you can’t get them reliably from a government agency you might want to dispatch your own enumerators to do your own data collection on the spot. but in non sensitive topics it’s just a matter I think of doing a dress rehearsal to make sure that they’re they’re going to do it I’ve been burned a few times by I’m not doing a dress rehearsal because what’s happened is I thought the data would reliably be produced at the same level that they had been produced in the past, but you know people change, the office changes, the procedures change and then they’ve started grouping in a different way and those are not the same groupings that I used, so I think when working with administrative data find out what groupings they’re going to be using when when presenting the data and then randomize at that group so there’s no mismatch between your unit of assignment and the unit of data collection. great question. other questions please. work done you had group setup placebo or kind of like pseudo to see little groups so this is kind of an open question so feel free just to point to some reference or something but could you just comment on how you think about placebo groups when you’re designing treatments and controls because it could have to be a large drawback on the social sciences RCTs? yes okay so so the basic question is what are the pros and cons of the use of placebos in social science experiments I would say on the pro side if you encounter severe non-compliance issues placebos have the potential to dramatically improve the precision with which you estimate the complier average causal effect. the downside is that they’re risky they have to be implemented in a symmetrical way. if there’s any violation of symmetry you’ve wasted resources and you’re you’re no better off because you don’t even know how to interpret the treatment versus placebo comparison because symmetry is throwing a wrench in the works but I think that more generally having a variety of placebos even when you’re not experiencing non-compliance is a way to get at causal mechanisms in a manner that is much more I think reliable plausible than the usual mediation analysis that is proffered in social science so yeah rather than use the kinds of loopy regression type analyses that are often done purportedly to study mediation the use of different conditions that add and subtract ingredients and effectively placebos if you like our way to isolate the active ingredients in the treatment of most interest to you so for example if you’re interested in conditional cash transfers is the cash or the conditionality or the combination of the two those are not literally placebos, but there are instances where you’re basically neutralizing some aspects of the of treatment in the hopes of identifying the active ingredient. how does that link to the use of placebos and biomedical research? well typically they want to know is that the act of taking a drug, or is it the drug itself so they have everybody taking the drug also – to maintain blindness among the experimenter to you know make sure the experimenter is not putting a kind of unconscious or conscious bias into the analysis of the results. so I think that for all those reasons placebos are helpful but, they are, you know, potentially wasteful, right? I mean if you – if you are expending a lot of resources to give somebody a placebo treatment you know you better have a good reason for it and and I think that where the stakes are very high in biomedical research they do it so for example in certain surgical RCTs they literally do sham surgeries on the people in the placebo group I mean that’s a that’s a high price to pay but that’s because the stakes are extraordinarily high so in some sense the-their use kind of hinges on and what the stakes are so are the compliance a similar generalization of the instrumental variables approach. and yes are there materials that you would recommend because our our group project is leaning towards a encouragement design can any issue is turning away members of the control group from receiving an educational intervention so we were leaning towards the encouragement design and we feel like that two sided non-compliance issue would probably come up in here? okay it’s a great question so anticipating the question here are the extra slides I have on that question so for exactly the reasons you mentioned the use of encouragement designs you very often encounter a situation of of two-sided non-compliance and the math is going to be very similar to the math that I showed but instead there being two latent groups now there are four and those extra groups are always takers people who will always take your treatment regardless of their assignment people who really really really want your treatment so badly that they’ll get it voluntarily if you don’t give it to them and then there are people called defiers- those perverse types who take your treatment only if assigned to the control group okay typically you make an assumption that there are no defiers and that just has to be that’s an assumption that has to be interrogated in case-by-case basis in Chapter six and the textbook give thoughts of examples of where that’s more or less possible but suppose you could throw out two fires then the estimation goes exactly as it did before you compare the estimated means in assigned treatment and assign control that difference is the intent to treat effect that’s in the numerator you divide that by the estimated share of compliers how did you how would you find that because you know that the difference in take-up rates in the treatment and control group is going to to reveal the share of compliers why because in the treatment group the comply the the people who take the treatment group are in the treatment group the people who take the treatment are not only compliers they’re also the always takers whereas in the control group assuming no defiers the only people who take the treatment are the always takers so when you compare the shares of people who take the treatment in the assigned treatment group in the assigned control group that reveals the share of compliers so that’s in the denominator it’s exactly the same as whoops I’m pointing for the I was pointing to the old place where the instrumental variables regression was but the the commands and Stata are exactly the same just to run through what this is same thing I just said but now we just have our our two rates exactly the same as before we subtract off to get again that eighty so that’s going to be our intent to treat effect and to estimate the the CACE we just divide by the share of compliers so that’s again the difference in compliance rate or treatment rate between treatment and control and here’s an example so in the study by Mola Nathan at all they studied actually the effects of viewing a mayoral debate in New York City. they had a 37% take-up rate in the treatment group, 16% in the control group so they were encouraged to watch the debate in the treatment group and a little bit more than a third of them did and about a sixth of them did spontaneously. okay then here’s the ITT the proportion who reported a change in their opinions and you would take the ITT this difference and divide by this difference to get the the CAC which is done here oops it was done there you cut off that slide other questions by the way let me just say a few more things about that the power of a study that encounters any kind of non-compliance hinges critically on the relative take-up rates and treatment of control and I’ve seen so some experiments I mean kind of hilarious experiments where you try to use an encouragement design and you basically encourage nobody and so you’re dividing effectively by zero you’ve got no compliers so everything hinges on having a successful implementation of the encouragement with an assumed treatment failures happiness that’s sort of the methodology that I’ve seen employed in the treatment arm it’s that sort of a variation of the second methodology that you presented or is that something that’s cool tinder men I so they’re they’re they’re imputing what treatment you actually received they’re they’re imputing what your outcomes would have been you want somebody elections and after six months or just a treatment for this I guess I would be very uncomfortable with that but I mean it’s a matter of full disclosure if you say warning my-my results are going to hinge on these ad hoc assumptions you know let the buyer beware it’s just that I don’t want to see it buried in some footnote or that you know that some mysterious and unreplicated method was used but if it’s- if it’s out there in bold letters then it seems at least somewhat more palatable but frankly I I would be very uncomfortable with any kind of RCT that made strong assumptions about it about missing outcomes because that that undercuts the whole point of doing the RCT. the whole point of doing it was to try to convince a determined skeptic but now the determined skeptic says I’m I’m more determined than ever because you’re you’re-you’re telling me something that’s not enough- any better than what I thought before which is why in that case I probably would be more inclined to do a double sampling approach where you do a random sample of the of the people who went missing and really go after their and that could have been done in the RAND health insurance experiment, they could have gone back and just you know take in administrative records and see seeing the death rates and treatment control but they didn’t. the point is to just prove some level of superiority to placebo and often the effect of that is someone from secondary to the justified of their superiority they’re sure it’s the most conservative approach on you because we assume that everybody in the treatment arm is a failure, right? that’s it’s like an extreme value bounds could definitely do that but yeah if all they’re doing is impute extreme values then it suffers from all the good points and bad points of the extreme value boughs they’re you know they’re being as cautious and as they possibly can be no problem there but I sort of understood you to say that they would be doing things like tearing forward the last observations including worst cases is okay but typically you’re it leaves you awash in uncertainty so if you’re still able to show something after doing that well then you probably don’t have much of a nutrition problem to begin with it’s probably relatively minimal whereas in my neck of the woods you know not Uganda-Uganda has a 99 percent response rate but in in in the US studies when we’re doing campaign experiments we’re you know we’re trying to follow up with with surveys we tend to have quite a serious attrition problem. extreme value Bounds, when we’d cover it would cross the zero for anything that would be plausible as an effect take one more and then wrap up if there is one more well it’s not your last chance if you have questions for me you can always email me happy to have you to talk about your field experiment projects so it’s kind of a cradle to grave relationship you can call me anytime and it’s been a pleasure happy to meet informally with people as they have questions before I take off and good luck with your projects [Music] [Applause]

Leave a Reply

Your email address will not be published. Required fields are marked *