Parse MorphoSource Workflow Development

MorphoSource is a robust online repository of 3D specimen data, especially useful for researchers and museums looking to share and explore digital morphology. In this post, we’ll walk through a GitHub Actions workflow that periodically scrapes MorphoSource data, checks if new records exist, and automatically creates a release if they do.

1. Overview of the Workflow

Below is the YAML configuration for our GitHub Actions workflow, which lives in .github/workflows/parse_morphosource.yml. This workflow does the following:

Schedules itself to run every 5 minutes (cron: "*/5 * * * *")
Sets up Python and installs dependencies
Runs a Python scraper to fetch data from MorphoSource
Commits changes to last_count.txt if new data is found
Creates or updates a GitHub Release with the newly discovered records

name: Parse MorphoSource Data

on:
  schedule:
    # Runs every 5 minutes (adjust as needed)
    - cron: "*/5 * * * *"
  workflow_dispatch:

permissions:
  contents: write

jobs:
  scrape_and_release:
    runs-on: ubuntu-latest

    steps:
      - name: Check out repository
        uses: actions/checkout@v3
        with:
          persist-credentials: true

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.9"

      - name: Install dependencies
        run: pip install requests beautifulsoup4

      - name: Run Scraper
        id: scraper
        run: python .github/scripts/scrape_morphosource.py

      - name: Commit updated last_count.txt
        if: steps.scraper.outputs.new_data == 'true'
        run: |
          git config user.name "github-actions"
          git config user.email "actions@github.com"
          git add .github/last_count.txt
          git commit -m "Update last_count.txt for new records"
          git push

      - name: Generate Timestamp
        id: gen_ts
        if: steps.scraper.outputs.new_data == 'true'
        run: |
          # Format: YYYY-MM-DD_HH-MM-SS
          TS=$(date +'%Y-%m-%d_%H-%M-%S')
          echo "timestamp=$TS" >> $GITHUB_OUTPUT

      - name: Create or Update Release
        if: steps.scraper.outputs.new_data == 'true'
        uses: actions/create-release@v1
        env:
          GITHUB_TOKEN: $
        with:
          tag_name: morphosource-updates-$
          release_name: "MorphoSource Updates #$"
          body: $
          draft: false
          prerelease: false

2. Understanding the Workflow Steps

`on.schedule`

The workflow runs every 5 minutes (as defined by cron: "*/5 * * * *").
You can adjust this to a different interval (e.g., once a day or once an hour) depending on how frequently you’d like to check for new records.

Permissions

contents: write allows us to push changes back to the repository (e.g., updating last_count.txt).

Checkout

Uses actions/checkout@v3 to pull the code from your repository so subsequent steps can access .github/scripts/scrape_morphosource.py and last_count.txt.

Set Up Python

Uses actions/setup-python@v4 with version 3.9.
Feel free to use other Python versions if needed.

Install Dependencies

Installs requests and beautifulsoup4, which are commonly used Python libraries for web scraping.

Run Scraper

Executes python .github/scripts/scrape_morphosource.py.
This script outputs two important values:
- new_data: A boolean (true or false) indicating if new MorphoSource records were found.
- details: A formatted message with details about new records.

Commit Updated `last_count.txt`

If new_data == 'true', the workflow commits updates to .github/last_count.txt.
This file typically tracks how many records we last saw, enabling the scraper to detect new records.

Generate Timestamp

Creates a variable (timestamp) based on the current date/time.
This is used later to form unique tag and release names.

Create or Update Release

If new_data == 'true', the step uses actions/create-release@v1 to automatically generate a new GitHub Release containing the details of newly found records.
- GITHUB_TOKEN is injected from a repository secret (MY_GITHUB_TOKEN) which allows the action to create releases on your behalf.

3. The Role of `.github/scripts/scrape_morphosource.py`

The actual scraping logic resides in your custom Python script, scrape_morphosource.py. Typically, it would:

Fetch data from MorphoSource (using requests).
Parse the HTML (using BeautifulSoup).
Determine if there are new records since the last run.
Write outputs (new_data and details) to GitHub Actions so the workflow knows whether to create a new release.

By leveraging environment variables and the GITHUB_OUTPUT file, your Python script communicates with the workflow to indicate next steps.

4. Why This Matters

This type of automated data checking and release creation can be beneficial for:

Scientific Data Monitoring: Regularly check for newly uploaded scans or specimen data.
Automated Reports: Generate real-time notifications or summaries for your team or collaborators.
Version Tracking: Use GitHub Releases to maintain a historical log of changes or new findings in the repository.

With GitHub Actions, the entire pipeline runs in the cloud, making it easy to schedule tasks, collaborate with others, and version-control your scraping workflow.

5. Customizing the Workflow

Feel free to adapt the workflow to your needs:

Change the Schedule: Instead of every 5 minutes, run hourly, daily, or weekly.
Add Notifications: After creating a release, you could add a step to send notifications via Slack, email, or any third-party integrations.
Expand the Scraper: If more metadata fields or different search parameters are needed from MorphoSource, update scrape_morphosource.py accordingly.

Conclusion

By combining Python web scraping with GitHub Actions, you can build a reliable, automated pipeline to monitor MorphoSource for new data. The workflow described above automatically detects new records, updates a counter file, and creates a GitHub Release—making it easy to track changes over time and share them with collaborators.

Happy scraping and automating!

← Previous Post $~~~~~~~~~~~$ Next Post →