Explaining the CT to Text Python Code & OpenAI Prompt

In this post, we’ll look under the hood of our CT to Text Python script, which processes MorphoSource release notes, extracts important record information, and leverages an OpenAI-powered API to generate readable text summaries. While we won’t reprint the entire codebase, we’ll focus on its major functions, showcasing how each piece contributes to transforming raw release data into a succinct, scientifically informed narrative.

1. Setting Up the Environment

At the start, we import necessary libraries, including re for regular expressions, sys and os for file and environment handling, and an openai-like class. We also fetch our OPENAI_API_KEY from environment variables. If it’s missing, the script gracefully exits.

from openai import OpenAI

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")

This ensures the script has everything it needs—particularly the API key—to interact with the OpenAI (or o1-mini) model.

2. Parsing Records with parse_records_from_body()

Purpose

Approach

A regular expression (RE_RECORD_HEADER) identifies when a new record begins. Each time a new record is found, we create a dictionary to store information. Then, for each key-value pair line (e.g., Taxonomy: Homo sapiens), we add the data to that record’s dictionary.

Key parts include:

When parsing is done, it returns a list of record dictionaries such as:

[
  {
    "record_number": "104236",
    "title": "Endocast [Mesh] [CT]",
    "detail_url": "...",
    "Object": "...",
    "Taxonomy": "...",
    ...
  },
  ...
]

3. Generating Text with generate_text_for_records()

Once we have a list of records, we need to turn them into a multi-paragraph summary. That’s where generate_text_for_records() comes in.

Prompt Construction

We assemble a prompt (called user_content in the script) by looping through each record’s data. For each record, we add lines like:

Record #104236:
 - Title: Endocast [Mesh] [CT]
 - URL: ...
 - Object: ...
 - Taxonomy: ...

We then append instructions telling the AI model how to write the summary:

The OpenAI (or o1-mini) API Call

Inside a try-except block:

4. The main() Function

The script’s entry point (main()) handles:

By printing the result, this script easily integrates with GitHub Actions—allowing the action to capture and use that text for updating the release notes.

5. Putting It All Together

Workflow:

  1. GitHub Actions triggers the script after detecting a new “morphosource-updates” release
  2. The raw body text is sent to ct_to_text.py, which parses record fields into more structured data
  3. The script calls OpenAI (or o1-mini) to generate a multi-paragraph summary, highlighting key anatomical or morphological insights
  4. The result is returned to the GitHub Actions workflow, which then appends this AI-generated description to a new release on the repository

Key Takeaways

Conclusion

The CT to Text Python script demonstrates a neat process for turning dense release data into a set of clear, scientifically oriented descriptions. Through Regex parsing, structured keys, and a well-crafted prompt for the OpenAI model, the code seamlessly automates the transformation from raw data to polished summaries—perfect for adding that final layer of clarity to your MorphoSource updates.


← Previous Post $~~~~~~~~~~~$ Next Post →