Parse MorphoSource Workflow Development
MorphoSource is a robust online repository of 3D specimen data, especially useful for researchers and museums looking to share and explore digital morphology. In this post, we’ll walk through a GitHub Actions workflow that periodically scrapes MorphoSource data, checks if new records exist, and automatically creates a release if they do.
1. Overview of the Workflow
Below is the YAML configuration for our GitHub Actions workflow, which lives in .github/workflows/parse_morphosource.yml. This workflow does the following:
- Schedules itself to run every 5 minutes (
cron: "*/5 * * * *") - Sets up Python and installs dependencies
- Runs a Python scraper to fetch data from MorphoSource
- Commits changes to
last_count.txtif new data is found - Creates or updates a GitHub Release with the newly discovered records
name: Parse MorphoSource Data
on:
schedule:
# Runs every 5 minutes (adjust as needed)
- cron: "*/5 * * * *"
workflow_dispatch:
permissions:
contents: write
jobs:
scrape_and_release:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v3
with:
persist-credentials: true
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.9"
- name: Install dependencies
run: pip install requests beautifulsoup4
- name: Run Scraper
id: scraper
run: python .github/scripts/scrape_morphosource.py
- name: Commit updated last_count.txt
if: steps.scraper.outputs.new_data == 'true'
run: |
git config user.name "github-actions"
git config user.email "actions@github.com"
git add .github/last_count.txt
git commit -m "Update last_count.txt for new records"
git push
- name: Generate Timestamp
id: gen_ts
if: steps.scraper.outputs.new_data == 'true'
run: |
# Format: YYYY-MM-DD_HH-MM-SS
TS=$(date +'%Y-%m-%d_%H-%M-%S')
echo "timestamp=$TS" >> $GITHUB_OUTPUT
- name: Create or Update Release
if: steps.scraper.outputs.new_data == 'true'
uses: actions/create-release@v1
env:
GITHUB_TOKEN: $
with:
tag_name: morphosource-updates-$
release_name: "MorphoSource Updates #$"
body: $
draft: false
prerelease: false
2. Understanding the Workflow Steps
on.schedule
- The workflow runs every 5 minutes (as defined by
cron: "*/5 * * * *").
You can adjust this to a different interval (e.g., once a day or once an hour) depending on how frequently you’d like to check for new records.
Permissions
contents: writeallows us to push changes back to the repository (e.g., updatinglast_count.txt).
Checkout
- Uses
actions/checkout@v3to pull the code from your repository so subsequent steps can access.github/scripts/scrape_morphosource.pyandlast_count.txt.
Set Up Python
- Uses
actions/setup-python@v4with version3.9.
Feel free to use other Python versions if needed.
Install Dependencies
- Installs
requestsandbeautifulsoup4, which are commonly used Python libraries for web scraping.
Run Scraper
- Executes
python .github/scripts/scrape_morphosource.py.
This script outputs two important values:new_data: A boolean (trueorfalse) indicating if new MorphoSource records were found.details: A formatted message with details about new records.
Commit Updated last_count.txt
- If
new_data == 'true', the workflow commits updates to.github/last_count.txt.
This file typically tracks how many records we last saw, enabling the scraper to detect new records.
Generate Timestamp
- Creates a variable (
timestamp) based on the current date/time.
This is used later to form unique tag and release names.
Create or Update Release
- If
new_data == 'true', the step usesactions/create-release@v1to automatically generate a new GitHub Release containing the details of newly found records.GITHUB_TOKENis injected from a repository secret (MY_GITHUB_TOKEN) which allows the action to create releases on your behalf.
3. The Role of .github/scripts/scrape_morphosource.py
The actual scraping logic resides in your custom Python script, scrape_morphosource.py. Typically, it would:
- Fetch data from MorphoSource (using
requests). - Parse the HTML (using
BeautifulSoup). - Determine if there are new records since the last run.
- Write outputs (
new_dataanddetails) to GitHub Actions so the workflow knows whether to create a new release.
By leveraging environment variables and the GITHUB_OUTPUT file, your Python script communicates with the workflow to indicate next steps.
4. Why This Matters
This type of automated data checking and release creation can be beneficial for:
- Scientific Data Monitoring: Regularly check for newly uploaded scans or specimen data.
- Automated Reports: Generate real-time notifications or summaries for your team or collaborators.
- Version Tracking: Use GitHub Releases to maintain a historical log of changes or new findings in the repository.
With GitHub Actions, the entire pipeline runs in the cloud, making it easy to schedule tasks, collaborate with others, and version-control your scraping workflow.
5. Customizing the Workflow
Feel free to adapt the workflow to your needs:
- Change the Schedule: Instead of every 5 minutes, run hourly, daily, or weekly.
- Add Notifications: After creating a release, you could add a step to send notifications via Slack, email, or any third-party integrations.
- Expand the Scraper: If more metadata fields or different search parameters are needed from MorphoSource, update
scrape_morphosource.pyaccordingly.
Conclusion
By combining Python web scraping with GitHub Actions, you can build a reliable, automated pipeline to monitor MorphoSource for new data. The workflow described above automatically detects new records, updates a counter file, and creates a GitHub Release—making it easy to track changes over time and share them with collaborators.
Happy scraping and automating!
← Previous Post $~~~~~~~~~~~$ Next Post →