GitHub Actions to Update last_count.txt

A common challenge in automated data processing pipelines is tracking how much data you’ve processed since the last run. For example, when scraping a website or reading from an API, you may need to store the count of previously processed items. By storing that count in a file (like last_count.txt) and letting GitHub Actions update it, you can keep your pipeline running smoothly without duplicating work.

In this post, we’ll explore how to implement a GitHub Actions workflow that:

  1. Runs a Python script to check for new data.
  2. Updates last_count.txt if new data is found.
  3. Commits and pushes the changes back to your repository automatically.

1. Why Track a “Last Count”?

Let’s say you’re scraping a site for new records every day. If you don’t store the number of records you found last time, you have no easy way of knowing whether today’s run has discovered additional items. Keeping an integer in a simple text file provides a straightforward checkpoint—one your script can read and update as needed.


2. Sample Workflow File

Below is a simplified .github/workflows/update_last_count.yml that demonstrates the main idea:

name: Update Last Count

on:
  schedule:
    - cron: "0 * * * *"  # runs hourly
  workflow_dispatch:

permissions:
  contents: write

jobs:
  scrape_and_update:
    runs-on: ubuntu-latest
    steps:
      - name: Check out the repository
        uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.9"

      - name: Install dependencies
        run: |
          pip install requests beautifulsoup4

      - name: Run scraper
        id: scraper
        run: python .github/scripts/scrape_data.py

      - name: Commit updated last_count.txt
        if: steps.scraper.outputs.new_data == 'true'
        run: |
          git config user.name "github-actions"
          git config user.email "actions@github.com"
          git add .github/last_count.txt
          git commit -m "Update last_count.txt for new records"
          git push

Key Points

  1. on.schedule:
    • The workflow runs every hour (as dictated by the cron expression).
    • You can change the schedule as needed.
  2. id: scraper:
    • Assigns an ID to the “Run scraper” step.
    • This allows you to reference outputs from this step (e.g., new_data).
  3. if: steps.scraper.outputs.new_data == 'true':
    • A condition that checks if the scraper found new data.
    • If so, it proceeds with committing updates to last_count.txt.

3. The Python Script

The script (.github/scripts/scrape_data.py) typically does the following:

  1. Load the old count from last_count.txt.
  2. Scrape or fetch the current count of records from your data source.
  3. Compare the old count to the new count.
  4. If new records are found, update last_count.txt and output new_data=true so the workflow knows to commit changes.

Simplified Example:

#!/usr/bin/env python3
import os
import sys

LAST_COUNT_FILE = ".github/last_count.txt"

def load_last_count():
    """Load the last recorded count from the file."""
    if not os.path.exists(LAST_COUNT_FILE):
        return 0
    with open(LAST_COUNT_FILE, "r") as f:
        return int(f.read().strip())

def save_last_count(count):
    """Save the new count to the file."""
    with open(LAST_COUNT_FILE, "w") as f:
        f.write(str(count))

def main():
    """Main logic to check for new data and update the count."""
    old_count = load_last_count()
    # Simulate fetching a "new count" from somewhere (e.g., scraping or an API call).
    new_count = 100  # Just an example

    if new_count > old_count:
        # Write GitHub outputs for new_data and details (if needed)
        print("::set-output name=new_data::true")
        # Save the new count to last_count.txt
        save_last_count(new_count)
    else:
        print("::set-output name=new_data::false")

if __name__ == "__main__":
    main()

How It Works

  1. load_last_count():
    • Returns the integer stored in last_count.txt, defaulting to zero if the file doesn’t exist.
  2. new_count:
    • A placeholder for a real-world process (e.g., scraping a site or calling an API) to determine the current count of records.
  3. Output Variable:
    • If the new count is higher than the old count, the script writes true to an output variable called new_data.
    • Otherwise, it writes false.
  4. save_last_count(new_count):
    • Overwrites the file with the new count, ensuring it’s ready for the next run.

4. Committing Changes to the Repo

When the Python script signals that new data was found, GitHub Actions commits last_count.txt back to the repository.

Example Workflow Snippet:

- name: Commit updated last_count.txt
  if: steps.scraper.outputs.new_data == 'true'
  run: |
    git config user.name "github-actions"
    git config user.email "actions@github.com"
    git add .github/last_count.txt
    git commit -m "Update last_count.txt for new records"
    git push

Additional Details

Git Commands:

  1. git config user.name and git config user.email:
    • Set a placeholder name and email for the commit, specifically for use by GitHub Actions.
  2. git add, git commit -m, and git push:
    • Ensure the updated last_count.txt file is staged, committed with a message, and pushed back to the default branch.

5. Benefits of This Approach

  1. Automation:
    • No manual updates are needed. The entire process runs on a schedule and handles updates seamlessly.
  2. Persisted State:
    • The last_count.txt file serves as a persisted state. Even if the runner restarts or you switch branches, the tracked count is preserved in version control.
  3. Collaboration:
    • Team members can see the history of changes to last_count.txt, providing visibility into how often new data is found.

6. Customizations

  1. Frequency:
    • Adjust the cron schedule or add manual triggers to control how often the workflow runs.
  2. Advanced Logic:
    • Enhance the scraper to track more than just a count (e.g., storing IDs of processed records or gathering detailed metadata).
  3. Notification Steps:
    • Add steps to send notifications via Slack, email, or other integrations when new data is detected.

Conclusion

Using GitHub Actions to maintain a file like last_count.txt is a simple yet effective strategy to keep your pipelines aware of what’s already been processed. By:

you can avoid redundant data processing and ensure downstream steps only trigger when new information is available.

If you’re building a scraping or data ingestion pipeline, consider implementing this pattern—it’s automated, transparent, and easy to adapt to any workflow.

Happy automating!


← Previous Post $~~~~~~~~~~~$ Next Post →