GitHub Actions to Update last_count.txt
A common challenge in automated data processing pipelines is tracking how much data you’ve processed since the last run. For example, when scraping a website or reading from an API, you may need to store the count of previously processed items. By storing that count in a file (like last_count.txt) and letting GitHub Actions update it, you can keep your pipeline running smoothly without duplicating work.
In this post, we’ll explore how to implement a GitHub Actions workflow that:
- Runs a Python script to check for new data.
- Updates
last_count.txtif new data is found. - Commits and pushes the changes back to your repository automatically.
1. Why Track a “Last Count”?
Let’s say you’re scraping a site for new records every day. If you don’t store the number of records you found last time, you have no easy way of knowing whether today’s run has discovered additional items. Keeping an integer in a simple text file provides a straightforward checkpoint—one your script can read and update as needed.
2. Sample Workflow File
Below is a simplified .github/workflows/update_last_count.yml that demonstrates the main idea:
name: Update Last Count
on:
schedule:
- cron: "0 * * * *" # runs hourly
workflow_dispatch:
permissions:
contents: write
jobs:
scrape_and_update:
runs-on: ubuntu-latest
steps:
- name: Check out the repository
uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.9"
- name: Install dependencies
run: |
pip install requests beautifulsoup4
- name: Run scraper
id: scraper
run: python .github/scripts/scrape_data.py
- name: Commit updated last_count.txt
if: steps.scraper.outputs.new_data == 'true'
run: |
git config user.name "github-actions"
git config user.email "actions@github.com"
git add .github/last_count.txt
git commit -m "Update last_count.txt for new records"
git push
Key Points
on.schedule:- The workflow runs every hour (as dictated by the cron expression).
- You can change the schedule as needed.
id: scraper:- Assigns an ID to the “Run scraper” step.
- This allows you to reference outputs from this step (e.g.,
new_data).
if: steps.scraper.outputs.new_data == 'true':- A condition that checks if the scraper found new data.
- If so, it proceeds with committing updates to
last_count.txt.
3. The Python Script
The script (.github/scripts/scrape_data.py) typically does the following:
- Load the old count from
last_count.txt. - Scrape or fetch the current count of records from your data source.
- Compare the old count to the new count.
- If new records are found, update
last_count.txtand outputnew_data=trueso the workflow knows to commit changes.
Simplified Example:
#!/usr/bin/env python3
import os
import sys
LAST_COUNT_FILE = ".github/last_count.txt"
def load_last_count():
"""Load the last recorded count from the file."""
if not os.path.exists(LAST_COUNT_FILE):
return 0
with open(LAST_COUNT_FILE, "r") as f:
return int(f.read().strip())
def save_last_count(count):
"""Save the new count to the file."""
with open(LAST_COUNT_FILE, "w") as f:
f.write(str(count))
def main():
"""Main logic to check for new data and update the count."""
old_count = load_last_count()
# Simulate fetching a "new count" from somewhere (e.g., scraping or an API call).
new_count = 100 # Just an example
if new_count > old_count:
# Write GitHub outputs for new_data and details (if needed)
print("::set-output name=new_data::true")
# Save the new count to last_count.txt
save_last_count(new_count)
else:
print("::set-output name=new_data::false")
if __name__ == "__main__":
main()
How It Works
load_last_count():- Returns the integer stored in
last_count.txt, defaulting to zero if the file doesn’t exist.
- Returns the integer stored in
new_count:- A placeholder for a real-world process (e.g., scraping a site or calling an API) to determine the current count of records.
- Output Variable:
- If the new count is higher than the old count, the script writes
trueto an output variable callednew_data. - Otherwise, it writes
false.
- If the new count is higher than the old count, the script writes
save_last_count(new_count):- Overwrites the file with the new count, ensuring it’s ready for the next run.
4. Committing Changes to the Repo
When the Python script signals that new data was found, GitHub Actions commits last_count.txt back to the repository.
Example Workflow Snippet:
- name: Commit updated last_count.txt
if: steps.scraper.outputs.new_data == 'true'
run: |
git config user.name "github-actions"
git config user.email "actions@github.com"
git add .github/last_count.txt
git commit -m "Update last_count.txt for new records"
git push
Additional Details
Git Commands:
git config user.nameandgit config user.email:- Set a placeholder name and email for the commit, specifically for use by GitHub Actions.
git add,git commit -m, andgit push:- Ensure the updated
last_count.txtfile is staged, committed with a message, and pushed back to the default branch.
- Ensure the updated
5. Benefits of This Approach
- Automation:
- No manual updates are needed. The entire process runs on a schedule and handles updates seamlessly.
- Persisted State:
- The
last_count.txtfile serves as a persisted state. Even if the runner restarts or you switch branches, the tracked count is preserved in version control.
- The
- Collaboration:
- Team members can see the history of changes to
last_count.txt, providing visibility into how often new data is found.
- Team members can see the history of changes to
6. Customizations
- Frequency:
- Adjust the cron schedule or add manual triggers to control how often the workflow runs.
- Advanced Logic:
- Enhance the scraper to track more than just a count (e.g., storing IDs of processed records or gathering detailed metadata).
- Notification Steps:
- Add steps to send notifications via Slack, email, or other integrations when new data is detected.
Conclusion
Using GitHub Actions to maintain a file like last_count.txt is a simple yet effective strategy to keep your pipelines aware of what’s already been processed. By:
- Automating state tracking,
- Preserving the pipeline’s state in version control, and
- Facilitating collaboration and visibility,
you can avoid redundant data processing and ensure downstream steps only trigger when new information is available.
If you’re building a scraping or data ingestion pipeline, consider implementing this pattern—it’s automated, transparent, and easy to adapt to any workflow.
Happy automating!
← Previous Post $~~~~~~~~~~~$ Next Post →