Automated Data Scraping with Github Actions

Data Scraping without a Database

Dec 2020 Edit: You can see a live example of this in my own GitHub profile readme

Nov 2023 Edit: GitHub Actions contain a lot of footguns. be aware of them all and move YAML complexity into code

A common need I have in open source community work, especially with static site generators and the JAMstack, is scraping and updating data. For example, in the Svelte Community site we scrape the GitHub star count and last update, and ditto Gatsby Starters. Of course, you could grab data clientside, and whatever you can’t do clientside, you can throw up a serverless function to do this.

But sometimes it just makes sense to scrape data once instead of every time your users access your site, especially if that data requires tokens your users may not have. Typically you’d set up a cronjob and send the data into a database somewhere. With GitHub Actions, you can do this all inside GitHub, AND save a version controlled history of all data.

I noticed Mikeal Rogers doing exactly this for his Daily OSS watcher project, and so finally took some time to check out his code and make a minimal repro so others can take it as a base.

Demo

You can see my demo in action here: https://github.com/sw-yx/gh-action-data-scraping.

For those new to npm, there is a simple npm script defined in package.json. This is so you can manually run it while writing and testing your code. The action workflow calls this same exact action to reduce any discrepancies.

The Script

Straight to the point:

on:
  schedule:
    - cron:  '0 8 * * *' # 8am daily. Ref https://crontab.guru/examples.html
name: Scrape Data
jobs:
  build:
    name: Build
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@master # check out the repo this action is in, so that you have all prior data
    - name: Build
      run: npm install # any dependencies you may need
    - name: Scrape
      run: npm run action # actually run your npm script for scraping
      # env:
      #   WHATEVER_TOKEN: ${{ secrets.YOU_WANT }}
    - uses: mikeal/publish-to-github-action@master
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # GitHub sets this for you

The basic idea, in English, is:

That’s it! Look ma, no database!

As part of your workflow, you can also fire off a static site build after this action completes, or weekly, or whenever else you like.

Limits

You can do whatever you like with this, including taking screenshots of sites!

The limits I can think of are the limits of GitHub and GitHub Actions:

In addition to these limits, GitHub Actions should not be used for:

  • Content or activity that is illegal or otherwise prohibited by their Terms of Service or Community Guidelines.
  • Cryptomining
  • Serverless computing
  • Activity that compromises GitHub users or GitHub services.
  • Any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used. In other words, be cool, don’t use GitHub Actions in ways you know you shouldn’t.

Be a good citizen, don’t abuse it and F this up for the rest of us!

More

I’m looking for more great usecases for GH actions:

Tagged in: #tech #ideas #open source

Leave a reaction if you liked this post! 🧡
Loading comments...
Webmentions
Loading...

Subscribe to the newsletter

Join >10,000 subscribers getting occasional updates on new posts and projects!

I also write an AI newsletter and a DevRel/DevTools newsletter.

Latest Posts

Search and see all content