Creating a Release When New MorphoSource Data Is Found

MorphoSource is a repository for 3D specimens, used extensively by researchers for sharing and discovering new digital morphology data. Automating the discovery of new data and then publishing this information in a convenient GitHub release can be a game-changer, particularly for research teams and open-source projects.

In this blog post, we’ll look at:

  1. How to scrape MorphoSource for new records.
  2. When new data is detected, creating a GitHub release with the metadata.
  3. Why this matters for scientific transparency and collaboration.

1. Overview

The idea is simple:

  1. Periodically run a workflow in GitHub Actions.
  2. Scrape MorphoSource for new records (e.g., additional 3D scans or updated data entries).
  3. If new entries exist, generate a GitHub release that includes the details of what’s changed.

Key Benefits


2. Sample Workflow File

Below is an example GitHub Actions file (.github/workflows/morphosource_release.yml) that demonstrates the concept:

name: MorphoSource Release

on:
  schedule:
    - cron: "0 0 * * *"  # Runs daily at midnight
  workflow_dispatch:

permissions:
  contents: write

jobs:
  check_for_new_data:
    runs-on: ubuntu-latest
    steps:
      - name: Check out code
        uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.x"
      
      - name: Install dependencies
        run: |
          pip install requests beautifulsoup4
      
      - name: Run Scraper
        id: scraper
        run: python .github/scripts/scrape_morphosource.py
      
      - name: Commit updated last_count.txt
        if: steps.scraper.outputs.new_data == 'true'
        run: |
          git config user.name "github-actions"
          git config user.email "actions@github.com"
          git add .github/last_count.txt
          git commit -m "Update last_count.txt for new records"
          git push
      
      - name: Generate Timestamp
        id: ts
        if: steps.scraper.outputs.new_data == 'true'
        run: |
          TS=$(date +'%Y-%m-%d_%H-%M-%S')
          echo "timestamp=$TS" >> $GITHUB_OUTPUT
      
      - name: Create Release
        if: steps.scraper.outputs.new_data == 'true'
        uses: actions/create-release@v1
        env:
          GITHUB_TOKEN: $
        with:
          tag_name: morphosource-update-$
          release_name: "MorphoSource Update $"
          body: $
          draft: false
          prerelease: false

Explanation

  1. on.schedule:
    • Runs the workflow once a day at midnight (0 0 * * *).
  2. Run Scraper:
    • Executes the Python script that checks MorphoSource for new data.
  3. Commit Updated last_count.txt:
    • This step only runs if the scraper found new records, preventing unnecessary commits.
  4. Generate Timestamp:
    • Captures the current date/time, which is used to create unique tags and release names.
  5. Create Release:
    • Publishes a release titled MorphoSource Update [timestamp] and includes details of newly found records.

3. The Python Scraper

A typical .github/scripts/scrape_morphosource.py script might:

  1. Track the old record count in .github/last_count.txt.
  2. Fetch the current record count by scraping MorphoSource’s X-ray Computed Tomography listing.
  3. Compare the old and new counts to determine if there are new records.
  4. If new items are found:
    • Compile a formatted list of the latest records.
    • Set outputs for GitHub Actions (new_data and details).

Example Script:

#!/usr/bin/env python3
import os
import requests
from bs4 import BeautifulSoup

LAST_COUNT_FILE = ".github/last_count.txt"
SEARCH_URL = "https://www.morphosource.org/catalog/media?q=X-Ray+Computed+Tomography"

def load_last_count():
    """Load the previous record count from last_count.txt."""
    if os.path.exists(LAST_COUNT_FILE):
        with open(LAST_COUNT_FILE, "r") as f:
            return int(f.read().strip())
    return 0

def save_last_count(count):
    """Save the new record count to last_count.txt."""
    with open(LAST_COUNT_FILE, "w") as f:
        f.write(str(count))

def main():
    old_count = load_last_count()
    
    # 1) Fetch data
    response = requests.get(SEARCH_URL, timeout=30)
    response.raise_for_status()
    
    # 2) Parse HTML
    soup = BeautifulSoup(response.text, "html.parser")
    results = soup.select("li.document.blacklight-media")
    new_count = len(results)
    
    # 3) Compare old_count and new_count
    if new_count > old_count:
        # 4) Build a details message of new items (just a placeholder)
        # In reality, you'd parse each new item’s metadata (title, taxonomy, etc.)
        diff = new_count - old_count
        details_msg = f"Found {diff} new record(s)!\n"

        # Output to GitHub
        print(f"::set-output name=new_data::true")
        print(f"::set-output name=details::{details_msg}")

        # Update last_count.txt
        save_last_count(new_count)
    else:
        print("::set-output name=new_data::false")

if __name__ == "__main__":
    main()

Key Steps

  1. Compare Counts:
    • If new_count > old_count, then new records are available.
  2. Set Outputs:
    • new_data is set to true or false.
    • details is a multi-line string describing what’s new.
  3. Persist State:
    • last_count.txt is updated so that the next run knows how many records were last seen.

4. Why Automate Releases?

Using GitHub Releases for new data found on MorphoSource offers several advantages:

  1. Centralized Tracking:
    • Each release acts as a time-stamped checkpoint, capturing what changed at a specific moment.
  2. Notifications:
    • Contributors and watchers get automated notifications when a new release is published.
  3. Easy Sharing:
    • Releases can include formatted details and links to newly discovered records, making it simple to share updates with colleagues.
  4. Versioning:
    • Over time, you build a complete history of changes without manually curating each one.

5. Customizations

  1. Schedule:
    • Adjust the cron to run at a different interval (e.g., every hour instead of daily).
  2. Additional Data:
    • Enhance the scraper to retrieve more metadata fields, such as taxonomic names, date uploaded, or data manager.
  3. Release Strategy:
    • Mark releases as drafts if you want to review them before making them public.
    • Create a pre-release if you’re still testing.
  4. Notification Steps:
    • Send Slack or email alerts when a new release is created, keeping team members in the loop.

Conclusion

By combining GitHub Actions and a simple Python scraper, you can automatically detect when MorphoSource has new data and publish a GitHub release summarizing those findings. This approach streamlines collaboration, maintains a versioned record of updates, and ensures everyone has easy access to the latest discoveries.

Give it a try! Automate your MorphoSource data checks and enjoy the benefits of continuous, transparent updates.

Happy scraping!


← Previous Post $~~~~~~~~~~~$ Next Post →