Explaining the CT to Text Python Code & OpenAI Prompt

In this post, we’ll look under the hood of our CT to Text Python script, which processes MorphoSource release notes, extracts important record information, and leverages an OpenAI-powered API to generate readable text summaries. While we won’t reprint the entire codebase, we’ll focus on its major functions, showcasing how each piece contributes to transforming raw release data into a succinct, scientifically informed narrative.

1. Setting Up the Environment

At the start, we import necessary libraries, including re for regular expressions, sys and os for file and environment handling, and an openai-like class. We also fetch our OPENAI_API_KEY from environment variables. If it’s missing, the script gracefully exits.

from openai import OpenAI

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")

This ensures the script has everything it needs—particularly the API key—to interact with the OpenAI (or o1-mini) model.

2. Parsing Records with parse_records_from_body()

Purpose

Scans the release body for text blocks identifying “New Record #XXXX Title: …”
Collects record details like Object, Taxonomy, Date Uploaded, etc.

Approach

A regular expression (RE_RECORD_HEADER) identifies when a new record begins. Each time a new record is found, we create a dictionary to store information. Then, for each key-value pair line (e.g., Taxonomy: Homo sapiens), we add the data to that record’s dictionary.

Key parts include:

Regex that matches lines like New Record #104236 Title: Endocast [Mesh] [CT]
Splitting lines on “:” to gather additional details
Storing recognized fields (like detail_url, Object, Taxonomy) into canonical keys to keep data consistent

When parsing is done, it returns a list of record dictionaries such as:

[
  {
    "record_number": "104236",
    "title": "Endocast [Mesh] [CT]",
    "detail_url": "...",
    "Object": "...",
    "Taxonomy": "...",
    ...
  },
  ...
]

3. Generating Text with generate_text_for_records()

Once we have a list of records, we need to turn them into a multi-paragraph summary. That’s where generate_text_for_records() comes in.

Prompt Construction

We assemble a prompt (called user_content in the script) by looping through each record’s data. For each record, we add lines like:

Record #104236:
 - Title: Endocast [Mesh] [CT]
 - URL: ...
 - Object: ...
 - Taxonomy: ...

We then append instructions telling the AI model how to write the summary:

Tone: Scientifically informed but broadly readable
Length: ~200 words per record
Focus: Species (taxonomy), object details, and notable anatomical or morphological features

The OpenAI (or o1-mini) API Call

Inside a try-except block:

The code sends our constructed messages to a chat completion endpoint
The model returns a summary text for each record
We strip out any extra whitespace or error-check if the call fails
If no records exist or the OPENAI_API_KEY is missing, the function returns an appropriate message instead

4. The main() Function

The script’s entry point (main()) handles:

Reading the release body file (passed in via command line argument)
Calling parse_records_from_body(body) to gather structured record data
Passing that list of records to generate_text_for_records(records) for the final AI-composed text
Printing the generated text to standard output

By printing the result, this script easily integrates with GitHub Actions—allowing the action to capture and use that text for updating the release notes.

5. Putting It All Together

Workflow:

GitHub Actions triggers the script after detecting a new “morphosource-updates” release
The raw body text is sent to ct_to_text.py, which parses record fields into more structured data
The script calls OpenAI (or o1-mini) to generate a multi-paragraph summary, highlighting key anatomical or morphological insights
The result is returned to the GitHub Actions workflow, which then appends this AI-generated description to a new release on the repository

Key Takeaways

Modular Design: Splitting tasks into smaller functions (parse_records_from_body, generate_text_for_records) keeps the code logical and testable
Regex & Key-Value Parsing: This combination efficiently transforms raw, unstructured text into a structured list of records
OpenAI Prompt Strategy: By carefully building and structuring the prompt, we ensure the AI is guided to produce the desired scientific yet accessible summaries

Conclusion

The CT to Text Python script demonstrates a neat process for turning dense release data into a set of clear, scientifically oriented descriptions. Through Regex parsing, structured keys, and a well-crafted prompt for the OpenAI model, the code seamlessly automates the transformation from raw data to polished summaries—perfect for adding that final layer of clarity to your MorphoSource updates.

← Previous Post $~~~~~~~~~~~$ Next Post →