Explaining the CT to Text Python Code & OpenAI Prompt
In this post, we’ll look under the hood of our CT to Text Python script, which processes MorphoSource release notes, extracts important record information, and leverages an OpenAI-powered API to generate readable text summaries. While we won’t reprint the entire codebase, we’ll focus on its major functions, showcasing how each piece contributes to transforming raw release data into a succinct, scientifically informed narrative.
1. Setting Up the Environment
At the start, we import necessary libraries, including re for regular expressions, sys and os for file and environment handling, and an openai-like class. We also fetch our OPENAI_API_KEY from environment variables. If it’s missing, the script gracefully exits.
from openai import OpenAI
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")
This ensures the script has everything it needs—particularly the API key—to interact with the OpenAI (or o1-mini) model.
2. Parsing Records with parse_records_from_body()
Purpose
- Scans the release body for text blocks identifying “New Record #XXXX Title: …”
- Collects record details like Object, Taxonomy, Date Uploaded, etc.
Approach
A regular expression (RE_RECORD_HEADER) identifies when a new record begins. Each time a new record is found, we create a dictionary to store information. Then, for each key-value pair line (e.g., Taxonomy: Homo sapiens), we add the data to that record’s dictionary.
Key parts include:
- Regex that matches lines like
New Record #104236 Title: Endocast [Mesh] [CT] - Splitting lines on “:” to gather additional details
- Storing recognized fields (like detail_url, Object, Taxonomy) into canonical keys to keep data consistent
When parsing is done, it returns a list of record dictionaries such as:
[
{
"record_number": "104236",
"title": "Endocast [Mesh] [CT]",
"detail_url": "...",
"Object": "...",
"Taxonomy": "...",
...
},
...
]
3. Generating Text with generate_text_for_records()
Once we have a list of records, we need to turn them into a multi-paragraph summary. That’s where generate_text_for_records() comes in.
Prompt Construction
We assemble a prompt (called user_content in the script) by looping through each record’s data.
For each record, we add lines like:
Record #104236:
- Title: Endocast [Mesh] [CT]
- URL: ...
- Object: ...
- Taxonomy: ...
We then append instructions telling the AI model how to write the summary:
- Tone: Scientifically informed but broadly readable
- Length: ~200 words per record
- Focus: Species (taxonomy), object details, and notable anatomical or morphological features
The OpenAI (or o1-mini) API Call
Inside a try-except block:
- The code sends our constructed messages to a chat completion endpoint
- The model returns a summary text for each record
- We strip out any extra whitespace or error-check if the call fails
- If no records exist or the
OPENAI_API_KEYis missing, the function returns an appropriate message instead
4. The main() Function
The script’s entry point (main()) handles:
- Reading the release body file (passed in via command line argument)
- Calling
parse_records_from_body(body)to gather structured record data - Passing that list of records to
generate_text_for_records(records)for the final AI-composed text - Printing the generated text to standard output
By printing the result, this script easily integrates with GitHub Actions—allowing the action to capture and use that text for updating the release notes.
5. Putting It All Together
Workflow:
- GitHub Actions triggers the script after detecting a new “morphosource-updates” release
- The raw body text is sent to ct_to_text.py, which parses record fields into more structured data
- The script calls OpenAI (or o1-mini) to generate a multi-paragraph summary, highlighting key anatomical or morphological insights
- The result is returned to the GitHub Actions workflow, which then appends this AI-generated description to a new release on the repository
Key Takeaways
- Modular Design: Splitting tasks into smaller functions (
parse_records_from_body,generate_text_for_records) keeps the code logical and testable - Regex & Key-Value Parsing: This combination efficiently transforms raw, unstructured text into a structured list of records
- OpenAI Prompt Strategy: By carefully building and structuring the prompt, we ensure the AI is guided to produce the desired scientific yet accessible summaries
Conclusion
The CT to Text Python script demonstrates a neat process for turning dense release data into a set of clear, scientifically oriented descriptions. Through Regex parsing, structured keys, and a well-crafted prompt for the OpenAI model, the code seamlessly automates the transformation from raw data to polished summaries—perfect for adding that final layer of clarity to your MorphoSource updates.
← Previous Post $~~~~~~~~~~~$ Next Post →