Creating a Release When New MorphoSource Data Is Found
MorphoSource is a repository for 3D specimens, used extensively by researchers for sharing and discovering new digital morphology data. Automating the discovery of new data and then publishing this information in a convenient GitHub release can be a game-changer, particularly for research teams and open-source projects.
In this blog post, we’ll look at:
- How to scrape MorphoSource for new records.
- When new data is detected, creating a GitHub release with the metadata.
- Why this matters for scientific transparency and collaboration.
1. Overview
The idea is simple:
- Periodically run a workflow in GitHub Actions.
- Scrape MorphoSource for new records (e.g., additional 3D scans or updated data entries).
- If new entries exist, generate a GitHub release that includes the details of what’s changed.
Key Benefits
- Centralized Log: Each release acts as a versioned snapshot of newly discovered data.
- Team Collaboration: Colleagues can “watch” the repository and get notified when new releases appear.
- Automated Documentation: Over time, you’ll have a chronological history of newly added MorphoSource specimens.
2. Sample Workflow File
Below is an example GitHub Actions file (.github/workflows/morphosource_release.yml) that demonstrates the concept:
name: MorphoSource Release
on:
schedule:
- cron: "0 0 * * *" # Runs daily at midnight
workflow_dispatch:
permissions:
contents: write
jobs:
check_for_new_data:
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.x"
- name: Install dependencies
run: |
pip install requests beautifulsoup4
- name: Run Scraper
id: scraper
run: python .github/scripts/scrape_morphosource.py
- name: Commit updated last_count.txt
if: steps.scraper.outputs.new_data == 'true'
run: |
git config user.name "github-actions"
git config user.email "actions@github.com"
git add .github/last_count.txt
git commit -m "Update last_count.txt for new records"
git push
- name: Generate Timestamp
id: ts
if: steps.scraper.outputs.new_data == 'true'
run: |
TS=$(date +'%Y-%m-%d_%H-%M-%S')
echo "timestamp=$TS" >> $GITHUB_OUTPUT
- name: Create Release
if: steps.scraper.outputs.new_data == 'true'
uses: actions/create-release@v1
env:
GITHUB_TOKEN: $
with:
tag_name: morphosource-update-$
release_name: "MorphoSource Update $"
body: $
draft: false
prerelease: false
Explanation
on.schedule:- Runs the workflow once a day at midnight (
0 0 * * *).
- Runs the workflow once a day at midnight (
- Run Scraper:
- Executes the Python script that checks MorphoSource for new data.
- Commit Updated
last_count.txt:- This step only runs if the scraper found new records, preventing unnecessary commits.
- Generate Timestamp:
- Captures the current date/time, which is used to create unique tags and release names.
- Create Release:
- Publishes a release titled
MorphoSource Update [timestamp]and includes details of newly found records.
- Publishes a release titled
3. The Python Scraper
A typical .github/scripts/scrape_morphosource.py script might:
- Track the old record count in
.github/last_count.txt. - Fetch the current record count by scraping MorphoSource’s X-ray Computed Tomography listing.
- Compare the old and new counts to determine if there are new records.
- If new items are found:
- Compile a formatted list of the latest records.
- Set outputs for GitHub Actions (
new_dataanddetails).
Example Script:
#!/usr/bin/env python3
import os
import requests
from bs4 import BeautifulSoup
LAST_COUNT_FILE = ".github/last_count.txt"
SEARCH_URL = "https://www.morphosource.org/catalog/media?q=X-Ray+Computed+Tomography"
def load_last_count():
"""Load the previous record count from last_count.txt."""
if os.path.exists(LAST_COUNT_FILE):
with open(LAST_COUNT_FILE, "r") as f:
return int(f.read().strip())
return 0
def save_last_count(count):
"""Save the new record count to last_count.txt."""
with open(LAST_COUNT_FILE, "w") as f:
f.write(str(count))
def main():
old_count = load_last_count()
# 1) Fetch data
response = requests.get(SEARCH_URL, timeout=30)
response.raise_for_status()
# 2) Parse HTML
soup = BeautifulSoup(response.text, "html.parser")
results = soup.select("li.document.blacklight-media")
new_count = len(results)
# 3) Compare old_count and new_count
if new_count > old_count:
# 4) Build a details message of new items (just a placeholder)
# In reality, you'd parse each new item’s metadata (title, taxonomy, etc.)
diff = new_count - old_count
details_msg = f"Found {diff} new record(s)!\n"
# Output to GitHub
print(f"::set-output name=new_data::true")
print(f"::set-output name=details::{details_msg}")
# Update last_count.txt
save_last_count(new_count)
else:
print("::set-output name=new_data::false")
if __name__ == "__main__":
main()
Key Steps
- Compare Counts:
- If
new_count > old_count, then new records are available.
- If
- Set Outputs:
new_datais set totrueorfalse.detailsis a multi-line string describing what’s new.
- Persist State:
last_count.txtis updated so that the next run knows how many records were last seen.
4. Why Automate Releases?
Using GitHub Releases for new data found on MorphoSource offers several advantages:
- Centralized Tracking:
- Each release acts as a time-stamped checkpoint, capturing what changed at a specific moment.
- Notifications:
- Contributors and watchers get automated notifications when a new release is published.
- Easy Sharing:
- Releases can include formatted details and links to newly discovered records, making it simple to share updates with colleagues.
- Versioning:
- Over time, you build a complete history of changes without manually curating each one.
5. Customizations
- Schedule:
- Adjust the
cronto run at a different interval (e.g., every hour instead of daily).
- Adjust the
- Additional Data:
- Enhance the scraper to retrieve more metadata fields, such as taxonomic names, date uploaded, or data manager.
- Release Strategy:
- Mark releases as drafts if you want to review them before making them public.
- Create a pre-release if you’re still testing.
- Notification Steps:
- Send Slack or email alerts when a new release is created, keeping team members in the loop.
Conclusion
By combining GitHub Actions and a simple Python scraper, you can automatically detect when MorphoSource has new data and publish a GitHub release summarizing those findings. This approach streamlines collaboration, maintains a versioned record of updates, and ensures everyone has easy access to the latest discoveries.
Give it a try! Automate your MorphoSource data checks and enjoy the benefits of continuous, transparent updates.
Happy scraping!
← Previous Post $~~~~~~~~~~~$ Next Post →