The Future of Empirical Research in the Age of AI

47 min

•Feb 6, 20264 months ago

Summary

Political scientists explore how AI tools like Claude can automate empirical research tasks, including data collection, code writing, and paper replication. The discussion examines both the productivity gains and significant limitations of AI in academic research, with a focus on the tension between automation and human oversight in maintaining research quality.

Insights

AI coding agents excel at routine tasks (data collection, regression analysis, code translation) but fail at conceptual reasoning, missing crucial contextual changes like policy shifts that redefine research variables
The productivity shock from AI will likely bifurcate academia: elite departments maintaining rigorous standards through human auditing, while lower-tier institutions rely on journal prestige metrics that become increasingly unreliable
Human auditing of AI-generated research is currently faster than manual replication but slower than unvetted AI output, creating a productivity paradox where quality assurance negates efficiency gains
AI simultaneously empowers sophisticated researchers while lowering barriers for low-quality work, risking a race-to-the-bottom where p-hacking and specification searching become effortless and undetectable
The journal system faces existential pressure: if AI enables 5x submission volume but referees remain constant, quality gatekeeping becomes mathematically impossible regardless of reviewer skill

Trends

AI-driven research reproducibility verification becoming a behavioral norm among serious researchers before journal submissionShift from static published papers to continuously updated 'living research' that ingests new data automaticallyEmergence of AI-powered discovery of natural experiments and identifying variation across legal/institutional changes at scaleBifurcation of academic institutions into high-scrutiny departments and prestige-dependent departments with declining quality controlMigration of applied social science work from academia to private sector and foundations where feedback loops enable faster convergence on correct answersIncreased risk of undetectable p-hacking and specification searching as AI makes generating plausible-looking false results trivialGrowing skills gap: AI complements sophisticated researchers but may disincentivize PhD students from learning foundational research design skillsPotential collapse of journal-based certification mechanisms in favor of departmental reputation and direct institutional auditingExpansion of software prototyping and tool-building as a new research methodology enabled by AI-assisted developmentDemand for new peer review and publication standards that account for AI-generated research and multiple testing at scale

Topics

AI-assisted empirical research replication and extensionClaude and Claude Code capabilities in social science workflowsAutomated data collection and regression analysisResearch reproducibility verification using AIP-hacking and specification searching risks with AIJournal system sustainability under AI productivity shocksAcademic hiring and reputation mechanisms in the AI eraPhD curriculum design for AI-native research skillsHuman auditing of AI-generated research outputsNatural experiment discovery and identifying variationLiving research and continuous data updatingVote-by-mail policy effects and electoral researchDifference-in-differences methodology and causal inferenceResearch ethics and responsible AI use in academiaInstitutional change and policy analysis automation

Companies

Anthropic

Claude and Claude Code AI tools used to automate research replication, data collection, and statistical analysis in p...

OpenAI

ChatGPT mentioned as competing AI tool; hosts noted OpenAI losing AI race to Anthropic's Claude

People

Andy Hall

Political scientist who used Claude to automate replication and extension of vote-by-mail research paper with co-auth...

Graham Strauss

UCLA PhD candidate who conducted human audit of Claude's research replication to verify accuracy and identify AI limi...

Dan Thompson

Co-author of original vote-by-mail paper that was replicated and extended using AI tools

Ethan Buenor de Mosquita

Podcast co-host and political scientist who conducted interviews with researchers using AI for empirical work

Viola Giuda

Podcast co-host discussing AI applications in academic research and personal experiences with AI tools

Anthony Fowler

Podcast co-host expressing skepticism about AI research quality and journal system implications

Alan Gerber

Co-author of earlier vote-by-mail research in Washington state that informed the replicated study

Seth Hill

Co-author of earlier vote-by-mail research in Washington state that informed the replicated study

Quotes

"It's an interesting question. I think if you did it again and said, we're replicating this, but somehow clued like the treatment we're interested in is vote by mail. The way we're measuring it is the VCA, as opposed to saying the treatment we're interested in is the VCA. I do think it's a perfect example of why this is not a tool that can go off on its own."

Andy Hall•Mid-episode

"I do think it could kill journals. I do think journals are going to have a hard time fulfilling their role as certification mechanisms."

Ethan Buenor de Mosquita•Late-episode

"This technology is a massive compliment to being high skilled at thinking conceptually as a social scientist, thinking in a sophisticated way about research design and econometrics, et cetera. And that you can do faster, better work with this tool if you're that kind of person. But it also is a mimic of that in a way that's very hard to detect."

Viola Giuda•Late-episode

"You could just say here's the conclusion i want Here's the statistical significance level. Go find me some. Right. And so I think like we're in this bad arms race between referees and slop generating authors that massively favors the slop generating authors."

Anthony Fowler•Late-episode

"You have to have some really top-notch people who understand what the really good research is playing around with the new technology if you're ever going to find out whether the new technology is useful for the top-notch research."

Ethan Buenor de Mosquita•Final segment

Full Transcript

I'm Ethan Buenor de Mosquita. I'm Viola Giuda. I'm Anthony Fowler, and this is not another Silicon Valley AI and Tech podcast. So normally we talk about political science in this podcast, and I think we'll do that a little bit today. We'll be talking about some political science research and also some broader implications for how political scientists do their research. But we're also talking a bit about AI and tech today, because there's at least some either fears or excitement that AI is going to drastically change the world. And there's no reason to think that political science would be exempt from that. Viola, what are your thoughts on that? Are you excited about our AI revolution? Has it changed any way that you do your work? I'm super excited, but every time I get very excited, I get extremely disappointed. I have this theory that it's actually a super, super powerful tool. Who doesn't like me? I asked it to, I told him I'm going to practice my lecture. And I asked him to just create a note based on that. And he gave me very detailed instructions what I was supposed to do. I was supposed to record myself on my iPhone and give him, it's a he in my world. So he told me to upload the MP3 file. and then he told me great I'm going to listen to that and create notes for you and then you know a few hours later I'm asking what's happening he's like oh I'm still listening I want to be extra careful and then it went on a few days you know I kept coming back to the office and eventually I asked him are you still listening and he's like what are you talking about I don't know how to listen it never recovered but I hope it will recover after today's episode because Ethan, you talked to someone who actually has a great relationship with Claude or other LLMs. Tell us about it. I did. I talked to two political scientists. I talked to a longtime friend of the podcast, Andy Hall, as well as Graham Strauss, who's an advanced PhD student in the UCLA political science department. Andy used a combination of Claude, the chatbot, and Claude Code, the coding agent, to automate replicating and extending a paper of Andy and Dan Thompson's from back in the days of the pandemic on vote by mail. And then Graham did a human reanalysis and extension to audit how Claude Code did or how Claude and Claude Code together did in this kind of task of replication, re-analysis and extension so that they could see how far this AI technology has come in being able to do certain parts of social science research. So let's give it a listen. So Andy, maybe you should actually even start all the way back with what the original paper is and then what you asked Claude to do. So pretty early in the pandemic, it was already becoming apparent that voting in person was in November of 2020 could be questionable. People might be worried about it. And so vote by mail became a very prominent policy debate. And it became even more prominent because President Trump was very skeptical of vote by mail. And so you had a bunch of people, especially on the left, who were particularly, let's say, COVID concerned, really pushing to maximize vote by mail. And in particular, whether it would advantage one party or the other became a very live issue. And so we decided that was somewhere where we could contribute something to the COVID policy arena in a way that was really core to our research. And it was actually Dan Thompson, who Graham works with, who had the idea for the paper, which was, well, everyone's talking about vote by mail. Did you know there's actually a group of states where this has been rolled out progressively over time. And there had been an earlier paper by Alan Gerber and Seth Hill and a few other people who had looked at this earlier in the state of Washington. And we were able to take advantage of time having gone by to do an update of that. And so we did a very simple paper where in California, Washington, and Utah, different counties implemented universal vote by mail at different times. And we used that stagger rollout to estimate the effect that that policy of sending everyone in the county a ballot had on both overall turnout and the partisan voting outcomes. And what we found was it has a meaningful but small effect on turnout. When you mail everyone a ballot, that's a pretty dramatic intervention, makes voting quite different. It does increase turnout, though not hugely. and it had what looks like a relatively close to zero effect on partisan vote shares. There is a positive point estimate for the Democratic Party. If you look at, let's say, the upper bound of the 95% confidence interval of the specification we think is most credible, you couldn't rule out some kind of advantage for Democrats, but it was not nearly as large. I mean, President Trump was claiming it was like went from them losing every election to them winning every election. You could certainly rule out effects of that magnitude. Time went by. And as so often happens in empirical research, no one in political science perceived it as being worth their time to just say, hey, there was this paper in 2020. It's 2025 now. Let's collect the data and run the regressions again. Okay. Awesome. And so being the tech forward digital native that you are, you thought maybe our new AI overlords presented a potential solution to this problem, right? I've been interested for a long time, just ways to improve the way we do research using technology. Just in the last few months, there has been a dramatic improvement in the quality of these coding assistants, both in general, and it seems specifically for doing this kind of empirical work. And so, yeah, over the holiday break, I got super deep into using one of these tools, Claude Code specifically, to do a bunch of different kinds of work. And then I gave it, yeah, this paper to Extend. What did you ask Claude to do and how you thought about getting Claude to do this replication task and what you asked of it? In this particular case, the way I did that was first by having a conversation with Claude in my web browser, so not clod code. And I'll explain how they're different in a moment. So I told Claude, I would like to see if I can extend this research paper that I wrote. And here I'm going to provide you the link to the PDF of the published paper in PNAS, as well as the link to the GitHub that contains all of the data and code. What I want you to do, Claude, is develop instructions for how Claude code should go about extending this through to the present day. What that led to was Claude developed a whole plan, quite detailed, of where you needed to go on the internet to get the new data, how you would need to munch that data together with the GitHub repo data to make the new data set. I wanted to do this with as little intervention as possible. I'm very confident we could have produced a more accurate extension had I intervened more, but I didn't want to. I wanted to do the most basic version. And so for the replication part, it said, I'm going to do this in Python. I'm going to translate this data code into Python. And in order to do that, it asked me, what Python package do you think is the best one to use to get as close as possible to the Stata package we had used to run the regressions, which was a Stata package called reg HDFE? And I told it to use this particular one called pyfixest. I think it's called that Apoorva and some other people developed. And then what it outputted is something I call an instructions.md file. So it's a markdown file that will then be ingested into Cloud Code that gives Cloud Code the lay of the land. And it's very, very helpful in working these coding agents to give them a lot of concrete information and structure. What's so interesting here is that the information and structure I gave it was itself AI generated. So then I took that plan, that file, I put it into Cloud Code. And then I launched, yeah, Cloud Code, which, as I said, is different from Cloud because it's a coding agent. So it can gather data and save it to your computer. It can produce LaTeX files and compile them into PDFs and so forth. Can I add one thing quickly? Like our instructions.md file that Andy mentioned is huge. Like we don't know how sloppy and bad it would have been had we just said, please extend Thompson et al. 2020. We will because we're writing a paper now. So now that we have the ground truth from Graham doing out the extension, we're going to rerun clog code on the extension with many different prompts and many different contexts. And one of the things we'll try is the no context version, which I'm sure Graham is right, will be terrible. Did you have it just straight replicate the paper in addition to extending it? Yes. And it did it perfectly. What did it find? What did it do? What did it find? It went out and found most, but not all of the new data. And Graham can speak to what he found in his audit of what it did. It found most, almost all of the new treatment observations. So there were new counties that had implemented the policy since last time we wrote the paper. And then it updated our exact diff and diff specifications and found something similar. Basically the same effect on turnout as before and a slightly smaller estimate on partisan vote share, though still with some noise. And then it also did some new analyses, which I had encouraged it to do in my instructions, but did not give as much detail on. And it did much more poorly on those. Okay, well, let's come to that. Because I'm interested in this question of like, did it come up with bad ideas? One thing it thought to do is hold out 2020. But then when it went to implement that, it held out 2024. I think one thing we had talked about is like context fatigue at this point, like as we need to dig a little deeper there. But Andy, what else are you seeing that it asked for? Like, I think that's just one. So right in the instruction file, it asked for separate estimates by period, by which it meant do one estimate that goes through 2018 and then one that goes 2020 to 2024. That kind of makes sense. It's like to see how much different is the new data than the old data. Then it started asking for weirder stuff. It wants a California specific analysis because California provides most of the new variation. That's not like a crazy idea, but I wouldn't have done that because it just seems kind of like an extra thing for no reason. Then it suggested an event study model, and it didn't really explain why it wanted that or give enough detail on how to do it. And finally, it asked for- I like to think it was worried about staggered diff and diff. And then it asked for robustness checks that were quite unclear So it said alternative specifications different fixed effect structures dropping 2020 test sensitivity to the COVID election year, placebo tests, if applicable. And that's all it said. Claude Code then took that and interpreted it in like a bunch of weird and problematic ways. It tried to do the event study just as the instructions asked for. and it just did it wrong, as Graham's audit showed. And then the thing I took the most issue with was the robustness. It did this robustness table with a bunch of fairly arbitrary choices, which do come to some extent from that instructions that I just read you. But all of the robustness checks are done relative to this specification I would call the vanilla diff and diff, which is like the county and state by year fixed effects. But we had shown in the original paper that that specification seems like it has a lot of pre-trending. And we had argued for using linear or quadratic time trends to show that there's not much of an effect once you count for those three trends. Given that, it doesn't make any sense to do your robustness check table with the biased specification that you know is biased. So it was just dumb. Okay, so let's turn to the audit. So Graham, tell me about, yeah. So basically, I opened the directory and read the instructions.md file and said, it's my job to do a human, like, do the human extension and do these tasks myself. for context, like a later stage PhD candidate at UCLA. And so I've worked a lot with US elections data and this was up my alley. And so I just dove in and got updated information on the treatment here, which is the rollout of vote by mail and updated information on the outcome, getting election data, cleaning it, getting it into the same panel structure as the original papers data set. And then, as Andy mentioned, that aligned really closely with what Claude did for the elections we both collected, but Claude failed to collect a few gubernatorial and Senate races in Utah and Washington. And so I got those additional elections. And then for counties with vote by mail, this was really interesting. So I told myself, don't look at the work Claude did before doing this myself, because I didn't want to be biased. And I went and just searched for every county in California and when they adopted the Voter's Choice Act, the DCA. And I got all of the same dates as Claude at first. And then reviewing it, there was one county, Imperial County, that both Claude and I at first made the mistake of thinking it was 2024 when it was actually 2025, because something on the Secretary of State's website was kind of confusing, and it looked like it had been in place by the 2024 election. So in 29 out of 30 cases, it's the same. Claude and I get the same date of adoption. But then the one other kind of interesting thing on treatment here is that we're trying to get the first date, the first election for which there's universal vote by mail. And the context, the original paper, it's all about the date of the VCA adoption. That determines when a county has vote by mail. But California passes this law making universal vote by mail, state bill 37, making it permanent from 2020 onward. So that's context that the original paper doesn't have. But after the publication of the original paper, this state law passes guaranteeing a ballot for guaranteeing a mail ballot. So that is just, in that sense, then all of California counties are treated in 2020. But Claude misses this entirely. Claude missed that entirely. Claude never thought, like, does the VCA still determine vote by mail, which is the thing we're interested in. Claude never paused. Okay. It's an interesting question. I think if you did it again and said, we're replicating this, but I don't know, somehow clued like the treatment we're interested in is vote by mail. The way we're measuring it is the VCA, as opposed to saying the treatment we're interested in is the VCA. I do think it's a perfect example of why this is not a tool that can go off on its own. At the same time, the combination of the tool and Graham's auditing produced an extension much faster. than if Graham and I had tried to do it without the tool. What's the verdict on the comparison of the audit to the automated? My read was similar. I was really, really glad we did the audit because it added so much to our understanding of both why this is impressive from Claude, but also why it's fundamentally limited. It demonstrated a significant amount of capability to collect data from open web sources, organize that data correctly, merge it, run regressions correctly, and report the results of those regressions correctly. And that is super valuable for us because those are things that take up a lot of our time and are not particularly fun to do. Yeah. And there are things that it couldn't do reliably even a year ago because of hallucinations, mistakes and other things. And so it's very exciting to see the tool progress this much to the point where it can do that stuff. But at the same time, I think Graham's audit also showed that if you were to unleash this thing and have it write and submit a whole paper without supervision, which I guarantee you people are already doing, it would be bad. Like it would not be a good paper. Obviously, you've shown that you can do some literal replication in a very automated way. And that's maybe bad for the replication in terms of journals, but good for the discipline. Like one of the things that I believe is like, if we're going to get serious about whether or not we should believe the estimates in our literatures, we're going to need to be able to do not just sort of, you know, rerunning people's regressions and all, but also maybe not as extreme as, you know, running the same experiment a million times, but something more like at scale reanalysis that lets us dig into, is there a regression in analysis D12 that actually is the one that would have been the specification I would have chosen given the argument? Or does it look like there were some very natural specifications that weren't running when you run them, the results do or don't hold up kind of stuff. Like we're going to be able to need to get to this more like that kind of somehow some sort of giant representation of like in the space if possible, sensible analyses for the idea in this paper? How often do we get a result kind of like the one that's reported? Are these tools promising there? I do. I think it's complicated. I would start by saying that I think I don't totally agree with your premise. I think pure, straight up reproducibility is still a big hindrance. I think probably the majority of all papers published in polystate journals are not reproducible would be my guess. And I assume our journals and our professional associations will take forever to do anything about this. I've completely given up on them basically. But I do think serious, creative, ambitious people in the field are already this week starting to move to a norm of like the next time a working paper is posted. I won't read it if it doesn't come along with a like, some AI verified that this is reproducible. And to your point, that is a minimal standard, but I think we're going to move to that standard pretty quickly. So I expect reproducibility verification, modulo some little issues we could get into to be a big deal just from a behavioral norm, even if the journals and the associations are unbelievably slow to do anything about this. There's going to be a trade-off we're in a phase. The audit's faster than writing the original paper, but it's way slower than letting the LLM do it. The audit we think is more right, catches a few mistakes, gets slightly different results. But if we decide we have to human audit everything, then the productivity gains are limited. Do you think we should balance that? As scientists, should we balance that in favor of always doing the human audit? How should we? I don't know. Yeah, it was not lost on me that 10, 20 hours of my work equals one hour of cloud work. An exciting idea that this project that I have coming out of this project is like a sort of living research. Like I how many how many times do we estimate a regression on old data and then we reestimate it and we get something slightly different. And so just, I think that is a really good use case where once we know we have everything set up correctly, ingest new data, clean it properly, re-estimate that. But even there in your audit, you discovered that when it ingested new data, it missed some stuff. I'm definitely mindful of that trade-off. I think that one of the constraints we're going to quickly hit is that for this style of research that the field has been doing for a while, it just is going to turn out that the binding constraint on our productivity and our value of our insights is actually that we don't have that many of these research designs more than it's that we needed more analyses of the same research designs. So we're not going to run into the constraint because we actually can't do that many analyses because there's just not enough identifying variation. I kind of think we're going to move on from this era. There's going to be whole new ways of doing research. I'm not saying they're better or that we should abandon the research design phase, but they're not going to make us that much progress on the research design style work because of the limiting amount of exogenous variation in these interventions in the world. And at the same time, they're going to massively empower other kinds of research that weren't possible before. And the kinds of things I have in mind, research that updates every day. So you can easily use these tools to automate that. Another thing that I think it's going to lead to is going to be really, really massive descriptive efforts that involve digitizing enormous amounts of information, being able to characterize in extreme detail every change that's ever occurred in the US tax code or something like that. And then the thing I've been spending most of my time on recently is it's opened up this whole new world of building little software prototypes on the fly. And so I think there's going to be a whole other kind of research people start to do where you build tools that are interesting, that have something to do with politics, whether that's intervening on how people use AI systems or tools for people when they're coordinating in their political lives in the local area at the county, state, federal level. There's just going to be a lot more like people building tools and then running experiments and user studies to see how they work because it used to be really hard to design and build those and now it's crazy easy. I do think it'll make our normal empirical work better for some of the reasons we've discussed, including the verifying the reproducibility, doing these things around the edges to make sure they're hyper accurate. But I also think it going to make us all kind of feel like we kind of hit the limits of what we were able to do with that whole approach My hope is this is so disruptive to the journal system that we going to have to move to new methods of review and standards that ideally aren going to involve this ridiculous race to find P less than 0 But if we don't, then for sure it's going to lead to a bunch of good-looking papers that are just the result of multiple testing by the AI. Yeah, editors are going to have to figure out a strategy, right? For sure. And referees. Well, the referees are all using AI, too. The reviews are all slop, too. So, yeah, it's a tough problem. I guess I don't know. I guess I'm not as impressed as they are in that the paper that they produced is really not an impressive paper. It's a very short paper. It's written in bullet points. Claude did make mistakes. And the fact that the estimates are similar is not so reassuring because, well, there's not a ton of new data anyway. And the estimate is pretty close to zero anyway. And so like attenuation bias is going to have a relatively small effect in this particular context. And so I guess I don't know how hopeful I should be in general that we can trust Claude to go collect new data and write new papers for us. To be clear, it's not just that they got the same estimates. Like, Graham's analysis and Claude's analysis are, with a couple little exceptions, identical. Like, Claude didn't just get the same numbers by accident. It did the task correctly. On that front, I came very differently than you, Anthony. So I would like to understand a little bit why you are so skeptical. So, So the number one, I think, that we've learned from that is that we do have this ability, or if not, then we'll have it soon, to update past papers with current data, which I think is a huge thing. It might be an easy task, but no one is doing that because there's no reward for that. And it's actually time consuming. Maybe we actually don't need to, maybe Andy doesn't need to write his first paper in the first place. All he needs is to think carefully about what's the question and what are the analysis he would like to run and all the robustness checks and then just give it to Claude. And that makes Andy much more productive. Is Claude going to make more mistakes than Andy would make? Right now, this analysis makes it a little bit reassured that not many more. For a while, I think it's still going to be making more mistakes than a human would do for just the obvious reasons. It seems unlikely that it will always be the case. But it's... Eventually automation. Exactly. Our planes fly with machines. Indeed. Why should this be different for just, you know, the thinking part, I understand this is a different story. But once you have a plan in equilibrium, that might lead to bad outcomes. But it's just amazing that we can gain this productivity. I mean, and Ethan, you guys talked about that in the interview, which I thought was really interesting. Like we don't really know how well it would do with a brand new task, right? This is one where we can already say, here's a really good paper already. Here's the GitHub. Here's exactly where you find all the data. We don't know how well it would do. But let's start even just, I think Viola's point is like, let's start even just with that, right? Think of all the good empirical papers written in 1998 or in 2004 or whatever, where the world has kept generating data and it is in nobody's interest to go find out whether the results still hold, either from the perspective of robustness as we have more information or has the world changed. you should actually love from the perspective of the science and it does it at vastly lower costs than we can do it with human labor. I think, I think, yes, I think I would love it if it does it reasonably accurately. This seems like a relatively easy case and it did make mistakes. You know, it did. So for example, it, like it miscoded the treatment timing for counties. It like, it missed the thing about the state level change in California. It, it just completely missed a couple elections, right? Like there were a couple elections it just missed and didn't put in the data set. And, you know, you could argue those are the kinds of minor mistakes that humans could make as well. But I mean, it's the kind of minor mistakes that humans make all the time. I've read your referee reports. But yes. So I guess the question is, is this better than nothing? And the answer is probably yes. But what if it's the case that it does make a lot of errors like this? Suppose we do this for a year on, you know, 100, 200 papers, and we audit a bunch of them and we discovered that like 96% of the time it's getting things right and we can in fact automate updating a huge amount of our empirical work. Sure, no, that's right. That'd be cool. If it makes enough errors that we only really trust it once we've audited it, then it doesn't really... Yeah, then it doesn't get us anywhere, 100%. If these AI agents are just going to go like kind of descriptively learn every detail about every institutional change, how the law evolved, how institutions evolved, whatever, and they could also wrap their neural nets around, their minds around what we mean by a good research design, like they might be able to suss out for us all sorts of identifying variation that we, there may be a bazillion natural experiment, little natural experiments out there from law changes, institution changes, whatever, that like we haven't accidentally stumbled across and that it'll alert us to. And maybe there's like, like Andy kind of has this theory, we're going to run out of identifying variation very quickly, but I kind of think it's possible that we're going to end up able to discover all sorts of identifying variation that we didn't know about because like none of us are able to absorb the corpus of law changes and institution changes and whatever, you know, but if it could learn, right, sort of what the basic structure of a regression discontinuity is, it could just go out into the world and read everything and see what little changes made that created an RD. And then we could then, once it told us about it, we could talk with it about what the regressions to run or what the questions and think, but like it might be able to discover a whole bunch of identifying variation that we don't know. I worry that's the kind of thing it's not good at. Maybe it will get better, but I mean, and as Andy talked about in the interview, I mean, he said the kind of thing it's getting really good at is writing code, which is interesting. Although usually that's not a limiting factor for most of us in our research. Like we don't write, usually we're not writing overly complicated code in our research projects. The collecting data is very tedious and time-consuming. But maybe it'll learn other things. Like maybe it'll just go read the whole history of the legal code and it'll be like, hey, I don't know if you know about this, but they created this rule and it had a score and there was a cutoff. That could be. That could be. And that would be cool. And then you'd have a source of variation that you didn't have before and you could think about what questions you could answer with it. It also poses new challenges. Like Cisco software makes p-hacking so much easier, right? And so that's a whole host of problems that we never thought about before, this could just ramp that up to a completely new level. That seems certain. It already has, for sure. That seems 100% certain. Can we talk a little bit systematically about this question? So let's assume that it is a productivity shock, which means that there will be many more good papers, but also many more crappy papers. So there will be many more papers overall coming from exactly the same pool of researchers. So usually, how do we deal with that? We have referring process where, you know, we don't really run the regressions and so on, but we do read carefully the paper and we try to see, you know, does the specification make sense? Is there a robustness check? Do I believe, you know, that the story somehow holds up? Is it a careful paper? My first, I think, concern is how do we do that? We don't have more researchers. We have the same pool of researchers. Sure, they are much more productive, so they have more time on their hands, but we have no incentives really in the system to have them use this extra time for refereeing. I don't know. What are your thoughts on that? I think there's a really first-order risk that this breaks the journal system. It's for free. You can generate a plausible-looking slop, and it's very, very difficult to separate the slop from the real difficult because you think it's going to be just like really physically difficult to to but or we are just not going to have the labor force to do that i think so for example like we talked about with with andy and graham the idea that like you could simply say to like with with exactly the methods andy just used right you could get clawed you could say like i want a paper with statistical significance and like a plausible sounding research design that shows that term limits are bad for blah, blah, blah. If you learned a little bit of how to work with Claude, you could make that happen for you this afternoon. And then Claude would write you a paper and you'd have to spend a little bit of time, you could do it this week and submit it to a journal. And the problem is we can generate now a bazillion of those, right? And that paper looks just like a current paper and refereeing a current paper is incredibly time consuming if you're going to actually like so how are you going to know that that's slop or p hack to death or whatever like you're going to have to do some extensive like robustness examination or go collect the data yourself whatever like there's a lot to it right but we don't do that right now and referees don't do it right now but right now there are costs associated you got to go collect that data you got to whatever like like it's already bad and this is going to make it for free like you can just say here's the conclusion i want Here's the statistical significance level. Go find me some. Right. And so I think like we're in this bad arms race between referees and slop generating authors that massively favors the slop generating authors. And in some sense, when I think about the current system, you know, there are sort of two institutions that are trying to differentiate among papers and that rely on reputation building. No one's institution is journals, but another institution is academic departments. I think the departments that decide to uphold reputation and be careful about whom they hire, I think they are already doing a better job than the journals are. yes maybe because you know the reputation is more persistent not journalists like you become an editor and then you step down so there's less less reasons to build reputation so should we be expecting that that's the way we are going to go and journalists will you know we already have disciplines we can name them where it seems like you know you pick up a paper at the top uh journal at random and it's random it could be great it could be not great uh so do we expect that this is what's going to happen, that actually journals are going to be less discriminating. I do think it could kill journals. I do think journals are going to have a hard time fulfilling their role as certification mechanisms. I also think if Andy's prediction that we're going to want more updatable, real-timey papers, that doesn't sit comfortably with a journal. Even if, I mean, there is a real dilemma just in terms of the volume, right? Even if it turns out, even if good referees and editors can distinguish between good and bad papers. And there are some kinds of things that you can catch and there are some kinds of things you can't catch, right? Like when you're editing a refereeing paper, you don't go independently collect the data to see if it made mistakes, right? But even if you are good at sorting out, identifying this law, even if it's just if the number of submissions to journals increases fivefold, journals can't do their job. Who is going to referee all those papers? Yeah. Given that we're going to have many, many more submissions. And so even if we abandon journals, it's not obvious yet how we sort out the good and the bad research and learn and do a better job of figuring out what's reliable and what's not. Can I make a prediction Please It worth a shot Let see Journals will be still there but they will be just much less discriminating And I think there will be departments that will decide hey we still want to be around smart people and have smart conversations And we only want to have papers that we really truly, around us, that we really truly trust. And those departments will do a careful job. When they hire people, they will read their papers, they will check their papers. and then there will be departments that will say well you know we it's hard we are not doing that we are just going to hire people who happen to publish in those journals that we used to consider as top five or top three and I think that that's going to happen there will be like this bifurcation and the the main negative that I see in that is that for the outside observer it's going to be very hard to distinguish what's good what's bad research you know it's going to come from a department and it's not you know we we don't we have no way to really certify that we are the good department versus the bad department and i think there will be a lot of research making its way to the new york times or you know just social media uh and and there will be just no way for the outside world to distinguish of what is what i mean this problem is already here to some extent like this is not what i said i you know i'm less hopeful than you are that there going to be these great departments that really work hard to distinguish on quality. This is going to substantially lower the costs of doing empirical social science research. That is going to invite for those cases where there's a kind of engineering application, where there's something in the world that hangs on what the answer is. Political professionals, corporations, et cetera, will do much more because it will be way easier for them to do kind of applied social science in a way that does not care at the level we do of getting of how precise the answer is. So I'm not saying it's going to be amazing social science, but there's going to be, I would guess, a ton of it. And the proof will be in the pudding in the sense that they will use it for actual decision-making and applications and trying to do things in the world in a kind of very engineering way. And that a lot of smart social science, that will create opportunities for a lot of smart social scientists to do things that are not kind of pure basic science. And that we'll see a lot of that move that well-trained people in the social sciences will have opportunities because they'll be able to generate work so much faster, so much more productively in politics or more likely in the private sector. And that will be an equilibrium and bad for the discipline of political science may be good for how much we know about the social world uh at least i don't know how it'll aggregate up but like good for how much like collectively humanity knows about in a dispersed way about the social world i don't know i i kind of think like this is a technological shock that is maybe going to allow social science to do a lot more engineering than we ever have been able to do before in ways that maybe are bad learning about are going to be like how to maximize ad revenue yeah no not things that are interesting to me i agree I said bad for academia. It's not completely true. We do have foundations that are interested, let's say, in fighting crime. And they do already do exactly that. They hire social scientists who do this research that they would be doing in their academic departments. But they are incentivized on getting it right because then money flows and then they're retested. Things are retested. So I don't think it's going to be completely uninteresting and geared towards private benefit. There's always a challenge for the important social science questions that we don't get enough feedback from the world. Like we don't know if the conclusions of our paper are right. We don't even know if the policy that was implemented is in fact effective or not. You could just go on and be wrong for decades and nobody will know. And so you can't... Thank God for that. You can figure out pretty quickly whether or not the little thing you tried for the social media company helps them make more money. And so we're probably better at those kinds of things than we are the actually interesting questions. But there's not really a market force that's like forcing the social scientists to eventually converge on the right answer. But nevertheless, that's all the more reason why we want really smart people working on these hard, important questions. Both of your predictions are pretty troubling, but plausible, I think, highly plausible. On the other hand, we have these tools that are a productivity shock to one, our ability to collect data, write code, et cetera, and also a productivity shock to our ability to do things like audit other people's papers and think about p-hacking and specification searching and how much we should believe. I think there's two sides to the coin. But I do worry. And I think a part of the interview that I found very interesting, I think Andy made some very good points, which is we will lose something in the process. Like he talked about people who are already just using fancy R packages and they don't even like check to see if their data merge correctly, which is kind of a horrifying thought. But I'm sure he's absolutely right that this is very common, that people don't really think about what they're doing. And AI will make this even worse. They won't even think about really what's happening under the hood. They'll just give their commands to the AI and then they'll get the results. And they'll have thought even less about like. This is a very interesting feature of this technology, right? Because like Andy's point, which seems obviously right to me, is that this technology is a massive compliment to being high skilled at thinking conceptually as a social scientist, thinking in a sophisticated way about research design and econometrics, et cetera. And that you can do faster, better work with this tool if you're that kind of person. But it also is a mimic of that in a way that's very hard to detect. And so it will simultaneously make the most serious people better, more productive, able to do more work. It may disincentivize acquiring those skills. And that is, I think, an unusual feature of this technology. That brings me to students. But I have to ask you guys, are we teaching our students this? Is it that right now all our PhD students in their first year get to replicate a paper using AI or do what ended it? Are we doing that? Because it seems like no matter how we feel about this technology, this is a skill we should be teaching them. And we should be teaching them as opposed to having them figured out on their own. because I think the latter is more likely to lead to them using this mindlessly, while if we teach them and tell them about the ethics of research and all this stuff, then we're more likely to have it used the way we want it to be used. I think it's not obvious that we should do that. And maybe we should, because maybe our graduate students will be just ill-equipped for the future if they are not really good at using cloud code. But it's possible, for the reasons that we were talking about earlier, that that actually harms your ability to really think through a problem, like collecting data yourself and writing the code yourself and forcing yourself to think through things. Like you learn things from the process that you wouldn't learn if you just gave instructions to an LLM that might make you a better social scientist. Maybe the right model is we teach you how to do stuff without any LLMs for a long time. And then we teach you, I don't know, I have no idea. I agree with you that we have to teach them both, but I feel like... I think it's unlikely that the way it goes is like, I mean, things get lost with technological change and also you adapt, right? And so like maybe it starts with like we teach them the old way and then the new way and whatever. But they will figure out, they will end up cutting out lots of parts of the old way in the same way that like I'm sure something was lost when we moved from, you know, doing engineering simulations in wind tunnels to engineering simulations on computers. But also if this allows, if this is actually worse for social science, we may never learn that, right? I know we're turning out more and more papers that are worse and worse, and we're learning less and less. I learned that vote by mail still with a little bit of extra data has a slightly positive effect on turnout and otherwise has very little effect on partisan vote shares. We kind of already knew that, but we add a little bit of data, which is nice. And so that's good. I guess I have two thoughts. One is I do think, I think like this is a very interesting time to be alive. I'm trying to be forward looking and optimistic and think about the possibilities. But it is also head spinning. Like it feels like things are changing so quickly and it is hard to think about the possibilities. And so I think I'm really glad that there's folks out there trying to think creatively about how can we take new, like not just being conservative about it. trying to like think about how can we use these new tools in ways that are generative, contribute to our intellectual agenda, but not in a Pollyanna-ish way. I don't think like from my interviews, certainly with Andy and Graham, I did not get the impression that what they wanted to say was like, this was an unequivocal success where this is an obvious win or whatever. They just want to say like, this is a thing we've wanted to be able to do. We tried to do it. Here's what we got. We're going to take, you know, we're going to take seriously auditing it. I thought that was good. I think it's good that they're trying to be creative and thoughtful about what can these tools do for us? And if any of our listeners can figure out how I can make ChatGPT like me again, then please email me. Maybe you just need to switch to Claude, but give up on ChatGPT. Maybe. Claude is a she. OpenAI is losing the AI race, it turns out, anyway. So I think maybe just switch to Claude. Do you worry that smart people like Andy, there aren't that many really smart, productive political scientists like Andy. Do you worry about people like that spending too much of their time on Claude and just not writing interesting, important papers? I mean, is that a real trade-off that the smart social scientists of our day, rather than answering important social science questions, are playing around with new technology? And maybe that'll turn out to be really great. I mean, some of them are spending time at primary schools, secondary schools. You know, it's not about AI. But you see what I'm saying? Like, is it possible that it's a distraction? and in the absence of it, there's that many more interesting papers we could have written, but instead we got kind of distracted by the shiny new technology. I don't know. I don't worry about that very much. You have to have some really top-notch people who understand what the really good research is playing around with the new technology if you're ever going to find out whether the new technology is useful for the top-notch research. I do think they're compliments in that way. And the whole point of being an academic is you get to do whatever you want. And so it's obvious that Andy loves thinking about the technology research intersection. And so I'm glad he's doing something that he thinks is fun. Hey, if you're getting a lot out of the research that we discuss on this show, there's another University of Chicago podcast network show that you should check out. It's called Big Brains. Big Brains brings you the engaging stories behind the pioneering research and pivotal breakthroughs reshaping our world. Change how you see the world through research and keep up with the latest academic thinking with Big Brains, part of the University of Chicago Podcast Network. Thanks for listening to Not Another Politics Podcast. Our show is a product of the Harris School of Public Policy and the University of Chicago Podcast Network. It's produced by Leah Ciesrin. If you like what you heard, please leave us a review. If you don't like what you heard, please send an email to one of my co-hosts. Thanks again for listening. Listen is the best.