Challenges Developing CT to Text in Production

Bringing our CT to Text pipeline from proof-of-concept to a production environment has been both exciting and challenging. From architectural considerations to preventing infinite loops, here’s a high-level look at the hurdles we faced and how we overcame them.

1. Evolving to a Microservices Architecture

Initially, CT to Text started as a single, monolithic script. As the features expanded, we discovered several limitations:

Scalability: A single process handling both data parsing and AI-driven text generation became slow and difficult to scale
Maintainability: Continually adding new features or altering existing ones created a fragile codebase that was more prone to errors

Solution

We split core functionalities—data retrieval, record parsing, AI summarization, release creation—into microservices. Each microservice can now be developed, tested, and deployed independently:

Data Fetching Service: Consumes APIs or local databases for the latest CT release data
Parsing Service: Extracts critical metadata—like specimen taxonomy and morphological details
Text Generation Service: Calls our AI model to transform structured data into human-readable text
Release Publisher: Packages the generated text and updates it back into GitHub releases

This modular approach helps ensure that any bug or performance bottleneck can be pinpointed quickly without disrupting the entire pipeline.

2. Preventing Looping Through Releases

One of the trickiest issues we encountered was inadvertently re-triggering the CT to Text workflow for the same release data, leading to an infinite feedback loop.

The Looping Scenario

Our Parse MorphoSource Data workflow runs and publishes a new release tagged morphosource-updates-1234
The CT to Text workflow sees the new tag, generates text, and publishes a new release (e.g., ct_to_text_analysis-<timestamp>)
An unguarded workflow could, in turn, re-detect that new ct_to_text_analysis-<timestamp> release as fresh data and re-run, resulting in an endless chain of AI summaries

The Fix

Storing State: We implemented a check to recognize prior releases by type or tag pattern. Whenever a new release is detected, the pipeline ensures:

The release is truly a morphosource-updates release, rather than an auto-generated ct_to_text_analysis release
The release tag hasn’t been processed before (e.g., storing the tag name in a small in-memory or persistent database, or via GitHub’s release history checks)

Tag Conventions: We adopted a strict naming pattern:

morphosource-updates-<something>: The release data the pipeline should process
ct_to_text_analysis-<timestamp>: The auto-generated release

If the pipeline sees a tag that does not match morphosource-updates-<...>, it skips it. This prevents the workflow from repeatedly acting on its own outputs.

3. Managing API Rate Limits

When working with GitHub for fetching releases and with our AI model’s API, hitting rate limits can stall or disrupt operations:

GitHub Rate Limits: Excessive checks for releases can trigger secondary rate-limiting measures
AI Model Rate Limits: Overzealous text generation calls in quick succession can lead to 429 “Too Many Requests” errors

Solution

We implemented:

Caching: Only call the AI API once per record. If a record was already generated, we skip calling the AI again
Scheduled Runs: Instead of polling continuously, we rely on specific triggers (e.g., a “workflow_run” event) or a limited schedule (e.g., once every few hours) to avoid accidental throttling

4. Observability & Logging

With so many moving parts in microservices, robust logging and observability are paramount:

Centralized Logging: Each microservice logs to a unified dashboard, making it easy to trace requests from the data ingestion step through to final summary creation
Telemetry: Metrics such as average response time from the AI model, number of records processed, and error rates help us identify bottlenecks in real time

5. Continuous Testing and CI/CD

Automated checks for each microservice ensure that new features or bug fixes don’t break existing functionality. Our CI pipeline runs:

Unit Tests: Test each microservice’s core logic, like record parsing or text generation stubs
Integration Tests: Emulate full workflows, from receiving a “morphosource-updates” release to publishing a new GitHub release
End-to-End Tests: Confirm the final published release matches expected formatting and contains valid text for each record

Benefit: This approach makes releases more reliable and prevents regressions that could lead to incorrect summaries or release tags.

Final Thoughts

Despite the complexity, adopting a microservices approach, incorporating robust checks for previous releases, and implementing strict tag naming have given us a stable, scalable CT to Text pipeline. We now have a system that can handle expansions to new data sources and text-generation improvements without the risk of infinite loops or an overwhelming monolith.

As we continue refining CT to Text, we’re confident this foundation will serve both current and future needs, ensuring each new morphosource-updates release can be rapidly converted into user-friendly insights—without the system ever chasing its own tail.

← Previous Post $~~~~~~~~~~~$ Next Post →