Pulling Metadata from MorphoSource
MorphoSource is a digital repository for 3D media, where researchers and enthusiasts can find and share high-resolution scans of natural history specimens. In this blog post, we’ll explore how a Python script can:
- Scrape MorphoSource to find newly uploaded records.
- Extract key metadata such as Object, Taxonomy, Element or Part, Data Manager, Date Uploaded, Publication Status, Rights Statement, and CC License.
- Output that metadata back to GitHub Actions for further use in your CI/CD workflow.
1. Overview of the Scraper
The Python script (which we’ll call morphosource_scraper.py) uses:
- Requests to fetch the search results page for MorphoSource’s X-ray Computed Tomography (CT) data.
- BeautifulSoup to parse the HTML structure.
- GitHub Actions environment variables to write or store relevant information for subsequent steps in a workflow.
Below is the key function that pulls the top records from MorphoSource:
def parse_top_records(n=3):
"""
Grabs the first n <li class="document blacklight-media"> from the search results
(descending by creation date). Returns a list of dicts containing relevant metadata.
"""
session = requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
resp = session.get(SEARCH_URL, headers=headers, timeout=30)
resp.raise_for_status()
# Check for server error (custom function to handle 500 errors)
check_for_server_error(resp.text)
soup = BeautifulSoup(resp.text, "html.parser")
# Select the first `n` results from the page
li_list = soup.select("div#search-results li.document.blacklight-media")[:n]
records = []
for li in li_list:
record = {}
# 1) Title & detail link
title_el = li.select_one("h3.search-result-title a")
if title_el:
record["title"] = title_el.get_text(strip=True)
record["detail_url"] = BASE_URL + title_el.get("href", "")
else:
record["title"] = "No Title"
record["detail_url"] = None
# 2) Additional metadata from dt/dd pairs
metadata_dl = li.select_one("div.metadata dl.dl-horizontal")
if metadata_dl:
items = metadata_dl.select("div.index-field-item")
for item in items:
dt = item.select_one("dt")
dd = item.select_one("dd")
if dt and dd:
field_name = dt.get_text(strip=True).rstrip(":")
field_value = dd.get_text(strip=True)
record[field_name] = field_value
records.append(record)
return records
How It Works
1. Scraping X-ray CT Data from MorphoSource
Steps:
-
GET Request:
The script performs a GET request toSEARCH_URL, where MorphoSource lists X-ray CT data. -
Parse HTML:
BeautifulSoupprocesses the HTML and locates each<li class="document blacklight-media">. -
Extract Title:
From the<h3>element, the script grabs the record’s title and a URL that leads to its detail page. -
Extract Metadata:
The scraper finds<dl class="dl-horizontal">blocks containing<dt>(field name) and<dd>(field value). -
Store Data in a Dictionary:
Each record’s metadata is added as key-value pairs (e.g.,"Object": "Some value").
Fields of Interest:
- Object
- Taxonomy
- Element or Part
- Data Manager
- Date Uploaded
- Publication Status
- Rights Statement
- CC License
By matching the <dt> text to these field names, the script reliably maps each piece of metadata to a dictionary key.
2. Writing Metadata to GitHub Output
Once the script has gathered the records, it constructs a formatted message and writes it to a GitHub Actions output file. This allows subsequent steps in your workflow to consume the data (e.g., for creating a release, generating a report, or sending a notification).
Relevant Code Snippet:
def write_github_output(is_new_data, message):
"""Helper function to write GitHub output"""
github_output = os.environ.get("GITHUB_OUTPUT")
if github_output:
with open(github_output, "a") as fh:
fh.write(f"new_data={str(is_new_data).lower()}\n")
fh.write("details<<EOF\n")
fh.write(message + "\n")
fh.write("EOF\n")
Key Steps
-
Determine if New Data Exists:
The script checks whether there are newly added records by comparing the current record count to a previously stored count. -
Prepare a Message:
It loops over the extracted records, appending lines for each metadata field found (e.g.,Object,Taxonomy, etc.). -
Write to
GITHUB_OUTPUT:
GitHub Actions uses this special environment variable to pass data between steps. The script writes two pieces of output:new_data: A boolean indicating if there are new records.details: A multi-line string (EOF block) containing the extracted metadata.
Example Usage in a GitHub Actions Workflow:
- name: Use scraper output run: | echo "New Data: $" echo "Details: $"
3. Why This Matters
Pulling metadata is invaluable for:
-
Research Tracking:
Keep tabs on newly uploaded scans relevant to your field of study. -
Automated Releases:
Automatically generate changelogs or summary releases that include details about newly posted specimens. -
Notifications & Alerts:
Integrate with Slack or email workflows to alert your team when certain taxa or specimen types appear on MorphoSource.
By having these fields (Object, Taxonomy, etc.) automatically harvested and ready to share, you can save time and ensure consistency in your data reporting pipeline.
4. Conclusion
Pulling metadata from MorphoSource can be done seamlessly with a Python script leveraging Requests and BeautifulSoup. After the script extracts key fields like Object, Taxonomy, Element or Part, and more, it writes them to GitHub Actions outputs. This powerful combination allows your data to be efficiently processed and integrated into your broader CI/CD workflows.
Whether you’re a researcher, educator, or developer, having an automated system to gather and broadcast morphological data can streamline your projects and keep everyone up-to-date. Feel free to expand this script to fetch additional fields or integrate with other services, and take advantage of GitHub Actions for an all-in-one automation solution.
Happy scraping!
← Previous Post $~~~~~~~~~~~$ Next Post →