Introducing CongressData and Correlates of State Policy

Wait 5 sec.

Introducing CongressData and Correlates of State PolicyDownload PDF Download PDF Data DescriptorOpen accessPublished: 10 July 2025Matt Grossmann1,Caleb Lucas2 &Benjamin Yoel1 Scientific Data volume 12, Article number: 1185 (2025) Cite this articleSubjectsEconomicsPoliticsAbstractSocial science research into policymaking, electoral processes, and governmental functions is heavily influenced by the availability and usability of specific, granular data. Though vast amounts of data are generated by government and collected by researchers, their utilization in academia is limited due to lack of availability, aggregation, and standardization. This paper introduces two resources that address this issue using a similar toolbox – CongressData and an updated version of the Correlates of State Policy database, which tackle these challenges at the federal and state levels in the United States. We describe our methodologies for data collection, standardization, and integration, and present tools that we designed to simplify the use of the datasets and their application in research. Consistent with our commitment to collaborative scientific advancement, these tools also automate the citation of sources from the datasets that researchers employ in their studies.Background & SummaryScholars of state- and federal-level politics and policy in the United States consistently face the challenge of aggregating and standardizing information from different units of observation and distinct studies, thereby limiting empirical understanding, theoretical comparison, and interdisciplinary knowledge accumulation. Scholars tend to focus on either relatively short periods in congressional politics [e.g.,1,2,3] and state politics [e.g.,4,5] or only on a subset of cases or geographic areas. In this paper, we describe two new databases with associated toolboxes: (1) a compilation of state-year data on US states, the Correlates of State Policy (CSPP), that covers over 3,000 unique variables for the period between 1900 and 20205, and (2) a database on members of the U.S. Congress, CongressData, covering over 1,000 unique variables between the years 1789 and 2024. We aim to aid research in political science and to make it easier for scholars outside of the discipline interested in the causes and effects of socio-economic, political, or policy differences to quickly integrate the accumulated data from the discipline.These databases address the needs of researchers seeking to assess short- and long-term trends on Congress and state politics in the United States. But they also contain long-ranging and geographically-widespread data that should be relevant throughout the social sciences. In essence, both help improve access to fine grained data on governance as well as district and state characteristics through merging, cleaning, and aggregating data from a range of sources and ensuring they are at identical units. For instance, scholars interested in the determinants of sponsorship and co-sponsorship of bills in Congress [e.g.,6,7] can use these data, as can scholars interested in understanding the determinants of policy liberalism in the American states [e.g.,8,9]. But so could researchers interested in the effects of local trade liberalization on elections or on whether state gun policies deter crime. Diverse uses benefit not only from data availability, but our efforts to aggregate coding information and match citation standards. We introduce associated R packages and web applications that allow researchers and policymakers to easily access, explore, and download the databases and its codebook.MethodsIn this section we describe how we constructed our databases on Congress and US states. Put simply, the focus of our Congressional dataset is members of the United States House of Representatives between the years 1789 and 2024, with the unit of analysis being member-years covering all districts within that timeframe. We define member-years as years in which a representative served for more than three months of a calendar year. The dataset offers a one stop shop for accurate data on the US Congress, bringing together information regarding various areas such as district demographics, member characteristics, and committee membership. This means data on individual representatives are tied to data on their geographic districts as well as their institutional behavior.Similarly, to better understand variation in outcomes in the American states over time, including in environmental and education policies, there is a need for time-varying data at the state-level. Thus, we constructed the Correlates of State Policy Project database using state-years as our unit of analysis and include all 50 US states and Washington, DC between the years 1900-2020. We anticipate that these standardized formats will offer the maximum public good to researchers and policymakers.Across both datasets, we merge the data without making changes to the information that it contains. However, we do conduct basic verifications of candidate datasets to ensure they meet typical standards that empirical social scientists employ when conducting quantitative analysis. Since our focus is overwhelmingly on data collected by governments or that were used in articles published by academic journals, our simple assessments of candidate datasets rarely substantiate issues of data integrity. However, to ensure reliability, we perform several checks regardless. This includes assessing the number of unique values across each variable, quantifying the number of missing values, and detecting outliers. These checks are not intended to verify the correctness of the data, but to identify potential issues that arose when the data was originally collected or cleaned. More so, we ensure that each variable that we include contains documentation regarding its meaning and origin to ensure the information can be used and cited by researchers leveraging our tools. Overall, our approach prioritizes transparency and reproducibility, as all variables remain linked to their original sources and are accompanied by metadata detailing their source. If users identify any issues with the data, they are able to email our project team directly at CorrelatesofStatePolicyProject@gmail.com.CongressDataIn developing CongressData, our database builds on previous work regarding the US Congress. The principal components draw on data related to congressional committee membership10, legislative effectiveness11,12, district demographics1,13, congressional member characteristics1,13, and bill introduction14. The dataset also includes original data that we collected, such as the proportion of bills introduced on cultural and economic issues in a given district-year, and has planned future additions that will increase its size and scope. After sourcing data from replication archives and other sources, we standardize and merge it into the dataset and reflect its inclusion in our codebook (available as a CSV, PDF, or searchable in the web applications) by adding information describing each variable and its original source.To achieve this, we needed common keys across the different datasets. In doing so, we relied on a range of existing identifiers, most notably the Inter-University Consortium for Political and Social Research (ICPSR) numerical codes for Members of Congress, the Biographical Directory of the United States Congress (Bioguide) ID, the district number, the congressional session, the name of the state, and the year. For instance, to merge data on legislative effectiveness in the House of Representatives from11,12, we relied on the fact that they utilize information on the congressional session number and Bioguide ID. After finding these common keys, we were able to merge the different datasets. This yields a database with four main types of variables on congressional politics: the bills data, district demographics, congressional member characteristics, and all other variables. While redistricting can complicate some large-scale data efforts, our focus on member-years ensures both that the dataset can offer be broadly useful and resolves concerns regarding how to handle changes in district maps over time. We substantiate the population of member years initially, then merge on characteristics associated with those individuals and the geography that they represent.Our database on the US Congress includes information on over 1,000 distinct indicators across all districts spanning the period between 1789 and 2024. By providing expansive data on both historical and contemporary congressional politics, scholars of congressional politics can re-evaluate existing theories and test competing ones using the dataset. More so, by offering all the variables at the same unit of analysis, we enable scholars to easily utilize data on a range of variables relating to the US Congress, either as their main variables of interest or as control variables in their analyses. To illustrate the breadth of information that the dataset offers to enable research in this domain, we visualize the distribution of its variables across categories in Fig. 1.Fig. 1CongressData: Variables by Category. The number of variables in CongressData across categories.Full size imageCorrelates of State PolicyWe also introduce an updated version of the Correlates of State Policy Project (CSPP), which was was first introduced in5. The foundation for the Correlates of State Policy database includes information from hundreds of sources on state government [e.g.,15,16,17], healthcare16,18, education policies19,20, criminal justice21, civil rights and liberties8,9, and economic indicators17,22. As with CongressData, we bring together data from a range of academic and non-academic sources and merge them into a single dataset, ensuring that they are all at the same unit of analysis, namely state-year.The current version of database on the American states includes information on over 3,000 distinct indicators across all 50 states and DC spanning the period between 1900 and 2020. This represents an increase of approximately 1,000 variables compared to the earlier version of the database. We use the state names or abbreviations and years to merge across the different sources of information. This process was made possible thanks to generous contributions of researchers across numerous academic fields, policy centers, and think tanks that have made data available for public use. We present a suite of tools for both datasets that generate citations for variables that users extract, ensuring that the original sources are credited for their contributions and simplifying the process of collecting references.We distinguish policy indicators based on whether they relate to economic policy, government policy, elections, public opinion, criminal justice, education, healthcare, welfare, rights and civil liberties, environmental policy, drug and alcohol policy, gun control, labor, transportation, and regulatory policy. Initially, the Correlates of State Policy featured approximately 2,000 unique variables5. Since then, we have added over 1,000 new variables, including on public sector unions, state domestic violence firearm laws, foreign direct investment, and the diffusion of redistributive policy across the US states. We illustrate these additions and the dataset’s overall substance in Fig. 2 below, which displays the number of variables across categories in the original release and in this updated version. In particular, the new version of CSPP has an extensive number of new variables on economic and labor policy, governance, social issues, and the environment. The breadth of information will enables new studies that relate to the characteristics of US states.Fig. 2CongressData: Variables by Category. The number of variables in CongressData across categories.Full size imageTo conclude, in order to illustrate the application of these datasets to time-varying studies within American politics, we provide an overview of each dataset’s coverage and scope. We visualize their cumulative number of variables across time in Figs. 3 and 4 below. As these figures indicate, both databases record information across the entire period that they cover, but much of the variables in each are relatively limited in time and focus on more recent periods, particularly since the 1970s, which makes sense given recent advancements in digitization and record keeping. Our intention is to continue adding to both databases to provide increased breadth and additional fidelity, as evidenced by the updated version of Correlates that we present in this manuscript.Fig. 3CongressData: Cumulative Number of Available Variables. The number of variables in available in CongressData over time.Full size imageFig. 4Correlates: Cumulative Number of Available Variables. The number of variables in available in Correlates over time.Full size imageData RecordsCongressData is available in a single dataset that records member years from 1789-202423. It offers a number of common member identification variables (bioguide, ICPSR, House History, etc.) along with the exact dates of the member’s term that is associated with each row of data. Correlates is available in a single dataset that records state years from 1900-2020. Beyond state names and abbreviations, it provides FIPS and state codes to enable connecting to external datasets23. Both datasets are documented in associated codebooks, available as PDF and CSV files.Usage NotesThe complete CongressData and Correlates of State Policy databases are available for download in four locations. First, we host the datasets described in this publication on figshare23. Second, we maintain a homepage for the Correlates data on the Institute for Public Policy and Social Research at Michigan State University: http://ippsr.msu.edu/public-policy/correlates-state-policy. Third, we developed R packages for both datasets, each of which is published on CRAN, that offer a range of useful functionalities and direct access to the data: cspp and csppData for the Correlates of State Policy Project; and CongressData for our Congress database. Fourth, we created interactive web applications for each of the databases; CSPP is available at https://cspp.ippsr.msu.edu and CongressData at https://cspp.ippsr.msu.edu/congress/. The applications allow users to interactively filter, visualize, and download tailored portions of the data with an auto-generated codebook, full variable descriptions, and associated citations.Technical ValidationAs discussed above, our approach to both datasets is to establish a reliable panel on which to merge useful information from a range of cited sources. To ensure the completeness of CongressData, we also compared the identification variables against authoritative sources. Specifically, we collected the BioGuide IDs from congress.gov and cross-referenced those with the IDs in CongressData. This resulted in a 99.7% match, with the six members not represented all having served less than the required 90 days for inclusion in our dataset. These members are: Curson/C001089, Hall/H001092, Jones/J000303, Lee Carter/L000605, Sekula-Gibbs/S001166, and Wied/W000829.To illustrate the implication of that result and the benefit of the multiple identification variables present in our dataset, we attempt to match member names across CongressData and the Legislative Effectiveness data11,12. We selected this dataset because it makes up a large number of variables in CongressData and has been utilized extensively in the literature on congressional politics to understand the causes and consequences of legislative behavior. For CongressData, we paste the firstname and lastname columns together and for the Legislative Effectiveness data we perform simple string operations to convert the thomas_name provided by the dataset in the form of Last Name, First Name to match CongressData’s format.Comparing the names between the two datasets reveals approximately 84% match without additional effort. Of course, we used a simple procedure and could refine the approach or individually match the 263 names that were problematic, but use this basic analysis to illustrate the benefit of providing names along with a range of ID variables in CongressData; the Inter-University Consortium for Political and Social Research (ICPSR) numerical codes for Members of Congress, the Biographical Directory of the United States Congress (Bioguide) ID, the House History ID, Wikidata, and Google Entity ID. Using the naïve name approach failed in cases such as ‘Dave Weldon’ and ‘David Weldon’ and ‘Enid Greene Waldholtz’ and ‘Enid Waldholtz.’Code availabilityReplication materials for this manuscript are available on Harvard Dataverse at https://doi.org/10.7910/DVN/EUUCRL. The datasets are available on figshare at https://doi.org/10.6084/m9.figshare.28146914.ReferencesHunt, C. R.Home field advantage: Roots, reelection, and representation in the modern Congress (University of Michigan Press, 2022).Binder, S. The dysfunctional congress. Annual Review of Political Science 18, 85–101 (2015).Google Scholar Polsby, N. W. & Schickler, E. Landmarks in the study of congress since 1945. Annual Review of Political Science 5, 333–367 (2002).Google Scholar Carsey, T. M., Niemi, R. G., Berry, W. D., Powell, L. W. & Snyder, J. M. State legislative elections, 1967–2003: announcing the completion of a cleaned and updated dataset. State Politics & Policy Quarterly 8, 430–443 (2008).Google Scholar Grossmann, M., Jordan, M. P. & McCrain, J. The correlates of state policy and the structure of state panel data. State Politics and Policy Quarterly 21, 430–450 (2021).Google Scholar Bernhard, W. & Sulkin, T. Commitment and consequences: Reneging on cosponsorship pledges in the us house. Legislative Studies Quarterly 38, 461–487 (2013).Google Scholar Kessler, D. & Krehbiel, K. Dynamics of cosponsorship. American Political Science Review 90, 555–566 (1996).Google Scholar Caughey, D. & Warshaw, C. Policy preferences and policy change: Dynamic responsiveness in the american states, 1936–2014. American Political Science Review112, 249–266.Caughey, D. & Warshaw, C. The dynamics of state policy liberalism, 1936–2014. American Journal of Political Science 60, 899–913 (2016).Google Scholar Stewart, C. I. & Woon, J. Congressional committee assignments, 103rd to 114th congresses, 1993–2017: House of representatives (2017).Volden, C. & Wiseman, A. E. Legislative effectiveness in the United States congress: The lawmakers (Cambridge University Press, 2014).Harbridge-Yong, L., Volden, C. & Wiseman, A. E. The bipartisan path to effective lawmaking. The Journal of Politics 85, 1048–1063 (2023).Google Scholar Foster-Molina, E.Historical congressional legislation and district demographics 1972-2014 Dataset https://doi.org/10.7910/DVN/CI2EPI (2017).Adler, E. S. & Wilkerson, J. D.Congress and the politics of problem solving (Cambridge University Press, 2013).Boehmke, F. J. et al. Spid: A new database for inferring public policy innovativeness and diffusion networks. Policy Studies Journal 48, 517–545 (2020).Google Scholar Boehmke, F. J. & Skinner, P. State policy innovativeness revisited. State Politics & Policy Quarterly 12, 303–329 (2012).Google Scholar Klarner, C. Governors dataset Dataset https://doi.org/10.7910/DVN/RYY3OW (2013).Sorens, J., Muedini, F. & Ruger, W. P. State and local public policies in 2006: A new database. State Politics & Policy Quarterly 8, 309–26 (2008).Google Scholar Lacy, T. A. & Tandberg, D. A. Rethinking policy diffusion: The interstate spread of “finance innovations”. Research in Higher Education 55, 627–649 (2014).Google Scholar Lyon, M. A. Heroes, villains, or something in between? how “right to work” policies affect teachers, students, and education policymaking. Economics of Education Review 82, 102105 (2021).Google Scholar Boushey, G. Targeted for diffusion? how the use and acceptance of stereotypes shape the diffusion of criminal justice policy innovations in the american states. American Political Science Review 110, 198–214 (2016).Google Scholar Klarner, C.State economic data Dataset https://doi.org/10.7910/DVN/KMWN7N (2013).Grossmann, M., Lucas, C. & Yoel, B. Introducing congressdata and correlates of state policy https://doi.org/10.6084/m9.figshare.28146914 (2025).Download referencesAuthor informationAuthors and AffiliationsMichigan State University, East Lansing, USMatt Grossmann & Benjamin YoelIndiana University–Bloomington, Bloomington, USCaleb LucasAuthorsMatt GrossmannView author publicationsSearch author on:PubMed Google ScholarCaleb LucasView author publicationsSearch author on:PubMed Google ScholarBenjamin YoelView author publicationsSearch author on:PubMed Google ScholarContributionsM.G., C.L., and B.Y. drafted the manuscript. M.G. managed the data collection process. B.Y. collected and cleaned data. C.L. cleaned data and created the web applications and R packages.Corresponding authorCorrespondence to Matt Grossmann.Ethics declarationsCompeting interestsThe authors declare no competing interests.Additional informationPublisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Rights and permissionsOpen Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.Reprints and permissionsAbout this articleDownload PDF