Parse MorphoSource Workflow Development

MorphoSource is a robust online repository of 3D specimen data, especially useful for researchers and museums looking to share and explore digital morphology. In this post, we’ll walk through a GitHub Actions workflow that periodically scrapes MorphoSource data, checks if new records exist, and automatically creates a release if they do.


1. Overview of the Workflow

Below is the YAML configuration for our GitHub Actions workflow, which lives in .github/workflows/parse_morphosource.yml. This workflow does the following:

  1. Schedules itself to run every 5 minutes (cron: "*/5 * * * *")
  2. Sets up Python and installs dependencies
  3. Runs a Python scraper to fetch data from MorphoSource
  4. Commits changes to last_count.txt if new data is found
  5. Creates or updates a GitHub Release with the newly discovered records
name: Parse MorphoSource Data

on:
  schedule:
    # Runs every 5 minutes (adjust as needed)
    - cron: "*/5 * * * *"
  workflow_dispatch:

permissions:
  contents: write

jobs:
  scrape_and_release:
    runs-on: ubuntu-latest

    steps:
      - name: Check out repository
        uses: actions/checkout@v3
        with:
          persist-credentials: true

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.9"

      - name: Install dependencies
        run: pip install requests beautifulsoup4

      - name: Run Scraper
        id: scraper
        run: python .github/scripts/scrape_morphosource.py

      - name: Commit updated last_count.txt
        if: steps.scraper.outputs.new_data == 'true'
        run: |
          git config user.name "github-actions"
          git config user.email "actions@github.com"
          git add .github/last_count.txt
          git commit -m "Update last_count.txt for new records"
          git push

      - name: Generate Timestamp
        id: gen_ts
        if: steps.scraper.outputs.new_data == 'true'
        run: |
          # Format: YYYY-MM-DD_HH-MM-SS
          TS=$(date +'%Y-%m-%d_%H-%M-%S')
          echo "timestamp=$TS" >> $GITHUB_OUTPUT

      - name: Create or Update Release
        if: steps.scraper.outputs.new_data == 'true'
        uses: actions/create-release@v1
        env:
          GITHUB_TOKEN: $
        with:
          tag_name: morphosource-updates-$
          release_name: "MorphoSource Updates #$"
          body: $
          draft: false
          prerelease: false

2. Understanding the Workflow Steps

on.schedule

Permissions

Checkout

Set Up Python

Install Dependencies

Run Scraper

Commit Updated last_count.txt

Generate Timestamp

Create or Update Release


3. The Role of .github/scripts/scrape_morphosource.py

The actual scraping logic resides in your custom Python script, scrape_morphosource.py. Typically, it would:

By leveraging environment variables and the GITHUB_OUTPUT file, your Python script communicates with the workflow to indicate next steps.


4. Why This Matters

This type of automated data checking and release creation can be beneficial for:

With GitHub Actions, the entire pipeline runs in the cloud, making it easy to schedule tasks, collaborate with others, and version-control your scraping workflow.


5. Customizing the Workflow

Feel free to adapt the workflow to your needs:


Conclusion

By combining Python web scraping with GitHub Actions, you can build a reliable, automated pipeline to monitor MorphoSource for new data. The workflow described above automatically detects new records, updates a counter file, and creates a GitHub Release—making it easy to track changes over time and share them with collaborators.

Happy scraping and automating!


← Previous Post $~~~~~~~~~~~$ Next Post →