I Let AI Write My PR Descriptions for 30 Days — The Results Shocked Me

#ai #automation #experiment #productivity

I hate writing pull request descriptions.

It is the most boring part of my day. I spend hours coding, solving complex logic puzzles, and debugging race conditions. Then I stare at a blank text box and try to summarize what I did in a way that makes sense to my team.

Usually, I write something vague like "Fixed bug" or "Updated API." My tech lead hates this. He sends it back with comments asking for more context. It creates a friction loop that wastes everyone's time.

In January 2026, I decided to stop doing it manually.

I set up an automated agent using the new GitHub Copilot Workspace APIs. The goal was simple: let the AI read the diff, analyze the commit history, and draft the PR description for me. I would only review and edit if necessary.

I ran this experiment for exactly 30 days. I tracked every metric I could think of. The results were not what I expected.

The Setup

I did not just turn on a generic chatbot. That approach fails because generic models lack context. They do not know our internal coding standards or the specific architecture of our monorepo.

I built a small Python script that hooks into the GitHub webhook events. When a PR is opened, it triggers a local LLM instance running Llama-3-405B via Ollama. We host it locally to avoid sending proprietary code to external APIs. Security compliance is strict at my company.

The prompt engineering took three days to get right. I had to teach the model to ignore whitespace changes and focus on logic shifts. I also forced it to follow our specific markdown template.

Here is the core logic of the prompt structure I used:

PROMPT_TEMPLATE = """
You are a senior staff engineer. Review the following git diff and commit messages.
Generate a Pull Request description following this exact format:

## Summary
[One sentence explaining the 'why', not the 'what']

## Changes
- [Bullet point 1]
- [Bullet point 2]

## Testing Done
[List specific unit tests added or modified]

## Risks
[Mention any potential side effects or breaking changes]

DIFF:
{git_diff}

COMMITS:
{commit_messages}
"""

The script posts this draft as a comment on the PR. I configured it to not automatically fill the description field. This gave me a safety net. I could copy-paste the AI's work or reject it entirely.

The Data

I tracked three main metrics over the 30-day period. I compared them against my average performance from the previous three months.

Metric	Manual Average (Oct-Dec 2025)	AI-Assisted (Jan 2026)	Change
Time spent writing PRs	12 minutes per PR	2.5 minutes per PR	-79%
"Request Changes" due to poor docs	18% of PRs	4% of PRs	-77%
Developer satisfaction (self-rated)	3/10	8/10	+166%

The time savings were obvious. But the drop in "Request Changes" was the surprise.

My tech lead noted that the AI was actually better at listing testing steps than I was. I often forget to mention which integration tests I ran. The AI scanned the test files and listed them automatically. This reduced the back-and-forth during code review significantly.

Where It Failed

It was not all smooth sailing. The AI struggled with context outside the immediate diff.

On day 12, I refactored a database migration script. The diff looked small, but the implications were huge. The AI wrote a summary saying "Minor syntax update to migration file."

If I had posted that, it would have been a disaster. Another engineer might have merged it without checking the data integrity constraints. I caught it during my review, but it scared me.

This highlighted a critical flaw. The AI cannot see the broader system impact unless you explicitly feed it that context. It is blind to architectural decisions made six months ago.

I had to update the script to include the last five related issues from Jira in the prompt context. This added about 4 seconds to the generation time but improved accuracy drastically.

Another failure mode was hallucination. On day 19, the AI claimed it added error handling for a null pointer exception. I checked the code. There was no such change. It had inferred that I should have added it, so it wrote that I did.

This is dangerous. It assumes intent rather than reporting facts. I now have a rule: never trust the "Testing Done" section without verifying the files exist.

The Human Element

The biggest shock was how it changed my relationship with code reviews.

Before this experiment, I viewed PR descriptions as a tax. I paid it reluctantly. Now, I view them as a verification step.

Because the AI handles the boilerplate, I spend my mental energy on the narrative. I ask myself: "Does this summary actually explain the business value?"

I found myself editing the AI's drafts less for grammar and more for clarity. The AI writes passive voice sometimes. It says "Changes were made to the user module." I change it to "I updated the user module to support SSO."