7 Claude Experiments That Accidentally Became Real Tools

A developer’s honest account of tinkering that got out of hand in the best way possible.

I did not set out to build a suite of internal tools. I set out to answer one question: how far can I push a language model with a single well-crafted prompt? The answer, it turns out, is embarrassingly far. What started as weekend experiments, throwaway scripts, API playgrounds, and half-baked ideas I fully expected to delete quietly became the most useful things in my daily workflow.

The pattern repeated itself seven times. I built something to satisfy curiosity. Two days later, I was using it at work. A week after that, a colleague was asking me to share it. That is the thing nobody tells you about experimenting with Claude: the bar between “prototype” and “real tool” is lower than you think, and the automation wins are sitting right on the surface.

Here are the seven experiments, ordered roughly by the complexity of how they evolved, not how they started.

1. A meeting notes cleaner that actually understood context

It started with raw Zoom transcripts, walls of “um,” “uh,” “can you hear me,” and three people talking at once. My first instinct was regex. My second instinct, about forty minutes into that, was to ask Claude instead. The prompt was embarrassingly simple: clean this transcript, extract the action items, and assign ownership based on who said what. It worked on the first try.

What became a real tool was the refinement loop: I built a Python wrapper that reads a raw .txt transcript, passes it to the API with a structured prompt, and spits out a clean markdown summary with a decision table. Ten lines of code. Runs in seconds.

import anthropic

client = anthropic.Anthropic()

def summarize_transcript(raw_text: str) -> str:
    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Clean this meeting transcript. Return:
1. A 3-sentence summary
2. Action items as a markdown table with Owner and Due Date columns
3. Key decisions made

Transcript:
{raw_text}"""
        }]
    )
    return message.content[0].text

The lesson here is not that the code is clever. It is not. The lesson is that “clean this transcript” is a solved problem the moment you stop trying to solve it with code and start treating it as a language task.

Pro tip: “The best automation is the one you actually use.” Start with the messiest, most repetitive task in your week. Chances are, it is a language task dressed up as a data problem.

2. A code review assistant that explains the why, not just the what

My second experiment came from frustration. I was reviewing a junior developer’s pull request and found myself writing the same comment I had written a dozen times before: “avoid mutating state inside a loop.” Instead of writing it again, I asked Claude to review the snippet and explain the issue in plain English with an example fix.

The output was better than what I would have typed. So I automated it. The tool now reads a git diff, passes each changed file through Claude with a persona prompt (“you are a senior engineer doing a constructive code review”), and outputs comments grouped by file. It does not replace human judgment. It handles the obvious stuff so the human review can focus on architecture and intent.

3. A documentation writer who reads the actual code

Documentation is the task every developer agrees is important, and none of us do properly. I built this one out of guilt after shipping a module with zero docstrings and then being unable to remember how it worked three weeks later.

The experiment: pass a raw Python file to Claude and ask it to generate NumPy-style docstrings for every function. The output was not perfect, but it was 80% of the way there and took two seconds. I now run it as a pre-commit hook. The tool writes the first draft; I edit the exceptions. Docstring coverage on my personal projects went from roughly 20% to over 90% in a month.

def generate_docstrings(source_code: str) -> str:
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"""Add NumPy-style docstrings to every function 
in this Python file. Return the complete file with docstrings inserted.
Do not change any logic.

{source_code}"""
        }]
    )
    return response.content[0].text

4. A changelog generator from git history

Here is a task I genuinely hated: writing changelogs. You run `git log`, stare at fifty commit messages like “fix stuff” and “wip,” and try to reconstruct a coherent release summary. The experiment was to pass the raw commit log to Claude and ask it to group changes into categories: Features, Bug Fixes, Breaking Changes, and write them in the format a human would actually want to read.

The catch, which I discovered immediately, is that good changelogs require good commit messages. Claude cannot invent context that is not there. What it can do is make weak commit messages readable and sort them intelligently. Good enough to cut the time from twenty minutes to two.

5. A prompt library that critiques itself

This one is the most meta experiment on the list. I was spending a lot of time iterating on prompts for other tools, so I built a tool that evaluates prompts. You give it a prompt and a set of sample inputs and expected outputs, and it scores the prompt on clarity, specificity, and edge-case coverage, then suggests revisions.

The recursive twist: the evaluator prompt was itself iterated on using the evaluator. At some point, it felt like handing a teacher an exam to grade that included questions about their own teaching quality. It worked. The meta-loop shaved hours off prompt engineering for every downstream tool.

Pro tip: The quality of your output is bound by the quality of your prompt. Treat prompts like code, version them, test them, refactor them.

6. A local file search tool that understands intent

Standard search is lexical. If you search for “budget,” you get every file with the word “budget.” What I actually wanted was a semantic search to find files related to Q3 planning, even if they never use those exact words. The experiment combined two things: embedding my local documents with `sentence-transformers` and using Claude to interpret the search query before matching.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_search(query: str, documents: list[str]) -> list[int]:
    query_emb = model.encode([query])
    doc_embs = model.encode(documents)
    scores = np.dot(doc_embs, query_emb.T).flatten()
    return np.argsort(scores)[::-1][:5].tolist()

The Claude layer sits on top: given the top five results, it ranks them by relevance to the original intent and explains why each made the cut. The semantic retrieval finds candidates; the language model interprets them. Neither works as well alone.

7. A weekly self-review that holds me accountable

The last experiment is the least technical and the most personally useful. Every Friday, a cron job reads my commit history, my task list, and a plaintext log I update throughout the week, and passes everything to Claude with a single prompt: “Write a weekly review that identifies what I accomplished, what I said I would do but did not, and one concrete recommendation for next week.”

It is uncomfortably accurate. The tool does not care about excuses. It reads what I shipped versus what I planned and gives me a plain-language gap analysis. I have shipped more consistently in the months since I started running it than in the year before.

My Thoughts

None of them started with the question “how can I use the Claude API?” Every single one started with a problem I was genuinely annoyed by. The technology was the solution, not the starting point. That distinction matters more than any library or prompt technique.

The second thing they share: they are all automations of language tasks. Code review comments, documentation, changelogs, and meeting notes look like technical problems, but they are fundamentally writing problems. Language models are not magic, but they are the right tool for the right job in a way that no amount of regex will ever match.

The third thing: every one of them took less than a weekend to reach a useful state. Not a polished state. Not a production-ready state. A useful one. That is enough to start.

Pick the task you hate most in your week. There is a good chance it is a language task. There is a better-than-good chance you can automate a meaningful part of it.

Drop your questions in the comments.

My Code Diary