Challenges Developing CT to Text in Production
Bringing our CT to Text pipeline from proof-of-concept to a production environment has been both exciting and challenging. From architectural considerations to preventing infinite loops, here’s a high-level look at the hurdles we faced and how we overcame them.
1. Evolving to a Microservices Architecture
Initially, CT to Text started as a single, monolithic script. As the features expanded, we discovered several limitations:
- Scalability: A single process handling both data parsing and AI-driven text generation became slow and difficult to scale
- Maintainability: Continually adding new features or altering existing ones created a fragile codebase that was more prone to errors
Solution
We split core functionalities—data retrieval, record parsing, AI summarization, release creation—into microservices. Each microservice can now be developed, tested, and deployed independently:
- Data Fetching Service: Consumes APIs or local databases for the latest CT release data
- Parsing Service: Extracts critical metadata—like specimen taxonomy and morphological details
- Text Generation Service: Calls our AI model to transform structured data into human-readable text
- Release Publisher: Packages the generated text and updates it back into GitHub releases
This modular approach helps ensure that any bug or performance bottleneck can be pinpointed quickly without disrupting the entire pipeline.
2. Preventing Looping Through Releases
One of the trickiest issues we encountered was inadvertently re-triggering the CT to Text workflow for the same release data, leading to an infinite feedback loop.
The Looping Scenario
- Our Parse MorphoSource Data workflow runs and publishes a new release tagged
morphosource-updates-1234 - The CT to Text workflow sees the new tag, generates text, and publishes a new release (e.g.,
ct_to_text_analysis-<timestamp>) - An unguarded workflow could, in turn, re-detect that new
ct_to_text_analysis-<timestamp>release as fresh data and re-run, resulting in an endless chain of AI summaries
The Fix
Storing State: We implemented a check to recognize prior releases by type or tag pattern. Whenever a new release is detected, the pipeline ensures:
- The release is truly a
morphosource-updatesrelease, rather than an auto-generatedct_to_text_analysisrelease - The release tag hasn’t been processed before (e.g., storing the tag name in a small in-memory or persistent database, or via GitHub’s release history checks)
Tag Conventions: We adopted a strict naming pattern:
morphosource-updates-<something>: The release data the pipeline should processct_to_text_analysis-<timestamp>: The auto-generated release
If the pipeline sees a tag that does not match morphosource-updates-<...>, it skips it. This prevents the workflow from repeatedly acting on its own outputs.
3. Managing API Rate Limits
When working with GitHub for fetching releases and with our AI model’s API, hitting rate limits can stall or disrupt operations:
- GitHub Rate Limits: Excessive checks for releases can trigger secondary rate-limiting measures
- AI Model Rate Limits: Overzealous text generation calls in quick succession can lead to 429 “Too Many Requests” errors
Solution
We implemented:
- Caching: Only call the AI API once per record. If a record was already generated, we skip calling the AI again
- Scheduled Runs: Instead of polling continuously, we rely on specific triggers (e.g., a “workflow_run” event) or a limited schedule (e.g., once every few hours) to avoid accidental throttling
4. Observability & Logging
With so many moving parts in microservices, robust logging and observability are paramount:
- Centralized Logging: Each microservice logs to a unified dashboard, making it easy to trace requests from the data ingestion step through to final summary creation
- Telemetry: Metrics such as average response time from the AI model, number of records processed, and error rates help us identify bottlenecks in real time
5. Continuous Testing and CI/CD
Automated checks for each microservice ensure that new features or bug fixes don’t break existing functionality. Our CI pipeline runs:
- Unit Tests: Test each microservice’s core logic, like record parsing or text generation stubs
- Integration Tests: Emulate full workflows, from receiving a “morphosource-updates” release to publishing a new GitHub release
- End-to-End Tests: Confirm the final published release matches expected formatting and contains valid text for each record
Benefit: This approach makes releases more reliable and prevents regressions that could lead to incorrect summaries or release tags.
Final Thoughts
Despite the complexity, adopting a microservices approach, incorporating robust checks for previous releases, and implementing strict tag naming have given us a stable, scalable CT to Text pipeline. We now have a system that can handle expansions to new data sources and text-generation improvements without the risk of infinite loops or an overwhelming monolith.
As we continue refining CT to Text, we’re confident this foundation will serve both current and future needs, ensuring each new morphosource-updates release can be rapidly converted into user-friendly insights—without the system ever chasing its own tail.
← Previous Post $~~~~~~~~~~~$ Next Post →