Advanced Web Data Collection Workshop

Syllabus 2026

Author

Cornelius Erfort (Witten/Herdecke University)

Published

April 16, 2026

Prepared for: Automated Web Data Collection 2026

Course repo: https://github.com/cornelius-erfort/automated-web-data-collection-2026

Rendered syllabus (HTML): https://cornelius-erfort.github.io/automated-web-data-collection-2026/index.html

Course Overview

The internet is an essential source of data for social science research, providing access to vast amounts of text and structured information. This course introduces students to both basic and advanced methods for automated web data collection, focusing on practical applications in political science and other social sciences. In addition to classic approaches (scraping static and dynamic content, working with APIs, and processing multiple data formats), we will cover how recent advances in AI and agentic coding can accelerate data collection: assistants can now often draft working scrapers quickly, iterate on errors, and help discover underlying requests (e.g., pagination) when combined with browser developer tools. Agentic coding can take over large parts of the scraping pipeline, but a basic understanding of underlying processes is still valuable. The course also covers browser automation, error handling, scheduling scraping jobs, and ethical and legal considerations.

Learning Objectives

By the end of this workshop, participants will be able to:

  • Understand web scraping fundamentals and best practices
  • Understand the role AI agents can play in scraping workflows (and how to supervise/verify them)
  • Use core R libraries for web scraping (rvest + read_html_live()/chromote, httr)
  • Handle different types of web content (static, dynamic, JavaScript-rendered)
  • Work with APIs and direct HTTP requests
  • Store and process scraped data in various formats (CSV, JSON, XML, etc.)
  • Implement rate limiting, respect robots.txt, and manage sessions/cookies
  • Automate scraping tasks and schedule jobs (cron, Docker)
  • Handle common scraping challenges, errors, and logging
  • Understand and apply ethical and legal considerations in web data collection

Course Schedule

Slides: Day 1, Day 2

Day 1: Fundamentals and Basic Scraping

Time Topic
9:00 Start, Introduction
Course overview, setup
HTML/CSS, Web Structure
HTML basics, CSS Diner, Selector Gadget, Basic file management
ca. 12:30–13:30 Lunch Break
Static Web Scraping, APIs and Data Formats
Continue with exercises and project work
Applied exercise block, Practical scraping (static sites, APIs)
Wrap-up
17:00 End

Day 2: Advanced Techniques and Best Practices

Time Topic
9:00 Start
Day 1 recap + roadmap for today
Dynamic content: rvest::read_html_live() (Chromote) demo (clicking / “Mehr laden”)
DevTools/Network: discover backend requests (“hidden APIs”)
Reproduce requests with httr (GET/POST, headers, cookies)
What can go wrong? (failure modes) + best practices (logging, caching, retries, rate limits)
ca. 12:30–13:30 Lunch Break
Ethics & legal (ToS, robots.txt, polite, attribution/licensing)
Agents & tooling: skills + MCP / DevTools integration (how to supervise/verify)
Wrap-up + Q&A
17:00 End

Resources

Example Web Scraping Projects

Easy

Medium

Difficult

Research Applications

  • Siegel, Alexandra A., and Vivienne Badaan. “#No2Sectarianism: Experimental Approaches to Reducing Sectarian Hate Speech Online.” American Political Science Review 114, no. 3 (2020): 837–55. https://doi.org/10.1017/S0003055420000283
  • Mitts, Tamar. “Banned: How Deplatforming Extremists Mobilizes Hate in the Dark Corners of the Internet.” Working Paper (2021). https://www.dropbox.com/s/iatnxn5gtq48fxu/Mitts_banned.pdf?dl=0
  • Elshehawy, Ashrakat, Arun Frey, Violeta I. Haas, Sascha Riaz, and Tobias Roemer. “The Police as Gatekeepers of Information: Immigration Salience and Selective Crime Reporting.” SocArXiv preprint (2025). https://osf.io/preprints/socarxiv/trhys_v1
  • Boas, Taylor C., and F. Daniel Hidalgo. “Controlling the Airwaves: Incumbency Advantage and Community Radio in Brazil.” American Journal of Political Science 55, no. 4 (2011): 869–85. https://doi.org/10.1111/j.1540-5907.2011.00532.x
  • Bischof, Daniel, and Thomas Kurer. “Place-Based Campaigning: The Political Impact of Real Grassroots Mobilization.” The Journal of Politics (2023). https://doi.org/10.1086/723985
  • Box-Steffensmeier, Janet M., et al. “I Get By with a Little Help from My Friends: Leveraging Campaign Resources to Maximize Congressional Power.” American Journal of Political Science 64, no. 4 (2020): 1017–33. https://doi.org/10.1111/ajps.12528
  • Motolina, Lucia. “Electoral Accountability and Particularistic Legislation: Evidence from an Electoral Reform in Mexico.” American Political Science Review 115, no. 1 (2021): 97–113. https://doi.org/10.1017/S0003055420000672
  • Sances, Michael W. “Defund My Police? The Effect of George Floyd’s Murder on Support for Local Police Budgets.” The Journal of Politics (2023). https://doi.org/10.1086/723979
  • Lutscher, Philipp M. “Hot Topics: Denial-of-Service Attacks on News Websites in Autocracies.” Political Science Research and Methods (2021): 1–16. https://doi.org/10.1017/psrm.2021.68
  • Erfort, Cornelius, Klüver, Heike, and Stötzer, Lukas F. “The PARTYPRESS Database: A new Comparative Database of Parties’ Press Releases.” Research and Politics (2023).
  • Dickson, Zachary P., Sara B. Hobolt, Catherine E. De Vries, and Simone Cremaschi. “Public Service Decline and Support for the Populist Right: Evidence from England’s National Health Service” (accepted at American Political Science Review). http://catherinedevries.eu/NHS.pdf
  • Morris, Kevin. “Turnout and Amendment Four: Mobilizing Eligible Voters Close to Formerly Incarcerated Floridians.” American Political Science Review 115, no. 3 (2021): 805–20. https://doi.org/10.1017/S0003055421000253
  • Gessler, Theresa, & Hunger, Sophia. “How the refugee crisis and radical right parties shape party competition on immigration.” Political Science Research and Methods, 10(3), 524-544 (2022). https://doi.org/10.1017/psrm.2021.64
  • Stukal, Denis, et al. “Why Botter: How Pro-Government Bots Fight Opposition in Russia.” American Political Science Review 116, no. 3 (2022): 843–57. https://doi.org/10.1017/S0003055421001507

Contact Information

For questions or concerns, please contact: cornelius.erfort@uni-wh.de