I Used Claude Code to Build a Local Archive of Every Paper I’ve Written
I'd put this task off for decades. Claude Code helped me finish it in an afternoon.
For most of my career I’ve wanted one organized place that holds the original PDF of every paper I’ve published. Not a list of links, since links rot, but the actual files, in one place I control. I never built it. It was always too tedious to justify the time, and too specialized for any tool I could buy. This week I built it with Claude Code in about half a day. The result lives at github.com/feamster/publications, and on my site.
The Idea
The design decision that made this tractable was to treat the CV as the source of truth. My CV already lists every publication, in order, grouped into journal, conference, workshop, theses, and books. So instead of maintaining a separate list, the tool parses the CV’s LaTeX source directly. Every paper cited there becomes an entry, joined to my BibTeX metadata, and the tool goes and finds the PDF for each one.
The output is a Git repository, organized by year, with every file named by its citation key and a generated README that mirrors the CV’s numbering and grouping exactly. Paper [137] is right there, linked, in the same order as my CV.
How It Fetches Things
About 250 PDFs spanning 1999 to 2026 meant building a waterfall of sources and trying them in order of cleanliness:
Open APIs first. OpenAlex, Unpaywall, Crossref, and Semantic Scholar resolve titles to DOIs and DOIs to open-access PDFs. They’re free, well-documented, and straightforward to script against.
Open-access venues. arXiv, USENIX, NDSS, the AAAI portal, and MIT’s DSpace (for my theses) serve PDFs at stable, predictable URLs. USENIX is the model here. Every paper is free, forever, at a clean URL.
Licensed content through the library. For ACM, Springer, and the like, the tool authenticates through my university’s library proxy, with single sign-on and two-factor, in a real browser session, and pulls the PDFs I’m entitled to.
Where It Got Hard
The instructive part, and the part that says something about academic publishing, is where automation failed.
IEEE Xplore sits behind an aggressive bot-detection layer. It blocked automated downloads even when I went through my university’s standard library proxy, which is a DNS-based redirect that rewrites the publisher’s hostname (EZproxy). The fix was to route traffic through a campus network connection so the requests came from an institutional IP directly. Even then, IEEE rate-limited me after about fifteen downloads and returned errors until I slowed down.
Everything else that failed did so in a similar way. A law journal’s repository threw bot challenges. ResearchGate wouldn’t hand a file to a script, since it gates downloads behind a login. One repository wouldn’t even complete a TLS handshake. A few of my oldest papers were only ever hosted on university servers that are now gone. This last set came down to a small handful of papers, so I tracked them down myself, usually through Google Scholar or ResearchGate, and dropped the PDFs in a folder. The tool then matched each one back to the right CV entry by title and DOI and filed it. At that scale, doing it by hand was no trouble.
The pattern is consistent. The open infrastructure made hundreds of papers effortless. The friction, the rate limits, the dead links, and the bot walls were all on the proprietary side. If you want a concrete argument for open access and open metadata, try assembling your own publication record and see which sources help and which fight you.
A Bonus: It Found a 20-Year-Old Error in My CV
Because the tool checks that each downloaded PDF’s title matches its CV entry, it flagged a mismatch on a paper from more than twenty years ago. My CV listed it under its tech-report title, from around 2005, but cited it as the journal version, which was published in 2007 under a different title. The entry had been wrong for the better part of two decades, copied forward from one CV to the next. I corrected the BibTeX, rebuilt the CV and website from it, and archived the correct published PDF. Forcing a machine to reconcile your records against reality is a good way to find what you’ve been overlooking.
The Workflow, and Making It Last
The part I care about most is that this isn’t a one-off script. It’s a repeatable workflow, captured as reusable Claude skills, so I can run it the same way every month.
The ordering follows from treating the CV as canonical:
Update the source. Add each new paper to my bibliography and to the CV. One rule the workflow encodes: a paper only enters the archive once it’s actually cited in the CV. The citation list defines membership, and the bibliography just supplies metadata.
Rebuild what’s derived from it. That’s the CV PDF and my website, both generated from the same bibliography.
Re-sync the archive. The skill rebuilds its index, fetches any newly available PDFs (open access first, then the library proxy, then the campus connection for the stubborn ones), regenerates the README, and commits.
The operational knowledge is written into the skill itself: which sources are friendly, how to authenticate, that IEEE will rate-limit you, and that the long-tail stragglers are easiest to drop into a folder and let the tool file. Keeping my paper archive current went from a vague, years-long aspiration to a ten-minute monthly task that mostly runs itself.
The Point
What’s notable about this project isn’t just the archive I’d always wanted. It’s the category of task, and how well suited agentic AI turns out to be for it. I’d wanted to do this for years and never started, because it was too tedious to justify a weekend and too specialized for any off-the-shelf tool. That describes a lot of valuable work: personal, fiddly infrastructure that never clears the bar. Agentic AI is unreasonably good at it. Half a day, start to finish, including the awkward stragglers and a latent error I didn’t know I had.
I suspect a lot of us have a project like this on the list. Mine’s done.

