A postdoc in your lab mentions at Wednesday meeting that she asked ChatGPT about a finding she thinks your group reported first. It answered confidently, and it cited a 2025 review in Nature Reviews. It did not cite your paper. She shrugs — that is just how AI works — and you half-agree, because you have more pressing things to think about than how a chatbot attributes credit.
This post is about what actually happens when a trainee asks an AI system about your work, and why the answer depends less on the quality of your paper than on which publisher hosts it. In a reviewer-grade measurement we ran this week on the top 50 biomedical journals by h-index, we found that only five of those journals are in a state where a well-behaved AI crawler could actually read their articles. The other 45 are blocked — by two distinct mechanisms we can document from public data, neither of which most PIs know about.
Of the top 50 biomedical journals by h-index, measured April 2026: 5 are fully open to a compliant AI crawler (PLoS ONE, The EMBO Journal, Journal of Clinical Investigation, NeuroImage, Chemical Society Reviews). 26 block at the server level, returning HTTP 403 to our research crawler regardless of what their robots.txt declares. 19 more allow our polite crawler through the server but name GPTBot, ClaudeBot, PerplexityBot, and Google-Extended as disallowed in robots.txt — including every Nature Portfolio title and every Cell Press title. A compliant AI bot would fetch none of those 19 either. Only 5 of 50 — 10% — are reachable for AI retrieval by a well-behaved crawler today.
What actually happens to a paper in a blocked journal
The chain is worth walking through slowly, because the step that matters for your citation count is in the middle and easy to miss.
A postdoc asks ChatGPT: "What's the latest on transcription factor X?" ChatGPT's retrieval layer searches the web, finds candidate sources — including a paper your group published last year — and tries to fetch each candidate's canonical page. For each reachable page, ChatGPT extracts the title, authors, abstract, and key claims, then decides which sources to cite in the answer and in what order.
For a paper on a blocked publisher platform, that fetch fails. Either the server refuses outright with HTTP 403 (one of our 26 server-blocked journals) or the journal's robots.txt tells the compliant bot not to fetch (one of our 19 robots-blocked journals). ChatGPT cannot extract the abstract or the citation metadata. So it picks the next-most-retrievable source that covers the same result: a 2025 Nature Reviews article that cites your paper, because Nature's article platform returns 200 and contains readable content. The review gets the citation. Your paper is mentioned inside the review's reference list — which ChatGPT did not read, because its retrieval layer stopped at the page it could parse.
Your name is still in the chain. The authority signal is not. A reader asking the system about your work is handed the review, not your paper.
The two blocking mechanisms, from the data
Mechanism one — server-level blocks (26/50 journals)
Twenty-six journals in our set returned HTTP 403 or the equivalent to our research crawler (UA string: AcademicSEO-Research/1.0 (+mailto:[email protected])). We attempted up to eight different recent DOIs per journal, so this is not a one-article fluke — every attempt returned 403.
To rule out the trivial explanation that publishers just block our specific UA string, we ran a diagnostic on nine representative blocked hosts with three additional user agents — a generic Firefox 128 string, a generic Chrome 124/macOS string, and our research UA. Every combination of host × UA returned HTTP 403. The block is therefore not at the UA-string layer. It is implemented at a deeper level (TLS fingerprint, JA3/JA4 hash, or header heuristics), which means it cannot be defeated by changing the user-agent string. Any AI crawler today that is not on the publisher's whitelist faces the same 403.
The server-blocked set, by publisher:
| Publisher | Blocked journals in our top-50 set | n |
|---|---|---|
| Lippincott Williams & Wilkins | Circulation, Journal of Clinical Oncology, Neurology | 3 |
| Oxford University Press | Nucleic Acids Research, Bioinformatics | 2 |
| Rockefeller University Press | The Journal of Cell Biology, The Journal of Experimental Medicine | 2 |
| Nature Portfolio | Nature Reviews Cancer, Nature Reviews Neuroscience | 2 |
| American Medical Association | JAMA, Archives of General Psychiatry | 2 |
| Other (one journal each) | NEJM (Massachusetts Medical Society), Science (AAAS), PNAS (NAS), Pediatrics (AAP), The Journal of Immunology (AAI), ACS Nano (ACS), Angewandte Chemie (Wiley), Annual Review of Biochemistry (Annual Reviews), American Journal of Psychiatry (APA), Psychological Review (APA), Physiological Reviews (APS), Annals of Internal Medicine (ACP), Journal of Neuroscience (SfN), Journal of Geophysical Research: Atmospheres (AGU), Blood (ASH, listed in OpenAlex as Elsevier BV) | 15 |
These journals cluster on a small number of publisher-platform stacks, but we are not going to name the specific middleware vendors because we cannot verify platform attribution at the HTTP level — our probe showed most of the hosts sit behind Cloudflare regardless of underlying platform, and the Server header does not distinguish them. Attributing "Silverchair" or "Atypon" by publisher name would be inference, not observation, and this post is being held to a reviewer standard.
Mechanism two — robots.txt disallow (19/50 additional journals)
Nineteen further journals passed our server-level fetch but declare at least one of GPTBot, ClaudeBot, PerplexityBot, or Google-Extended as disallowed in their robots.txt. Our polite crawler reached them because our UA string is not any of those named bots — but a compliant AI bot that self-identifies honestly and obeys robots.txt, which is what GPTBot, ClaudeBot, and PerplexityBot all do according to OpenAI, Anthropic, and Perplexity's published crawler documentation, would not fetch these pages. The net operational result for an AI system is identical to a 403.
This set includes, among others:
- All reachable Nature Portfolio titles — Nature, Nature Communications, Nature Medicine, Nature Genetics, Nature Biotechnology, Nature Neuroscience, Nature Immunology, and the Nature Reviews imprints we could reach. Every one of these declares GPTBot, ClaudeBot, PerplexityBot, and Google-Extended disallowed.
- All reachable Cell Press titles — Cell, Neuron, Immunity, Molecular Cell — via their ScienceDirect-hosted article pages. Elsevier's
sciencedirect.complatform also declares the four main AI bots disallowed. - Other Elsevier titles — Gastroenterology, The Lancet Oncology, the Journal of the American College of Cardiology, the Journal of Biological Chemistry.
- Genes & Development (Cold Spring Harbor).
We validated these robots.txt parses with two independent parsers: our own hand-rolled implementation, and the Protego library used by Scrapy (a spec-compliant RFC 9309 parser). On 160 bot × journal combinations the two parsers agreed in 100% of cases. The declared-blocked claim is therefore robust to parser error.
The declared-vs-observed gap, across all 26 server-blocked journals
The most instructive finding — the one I want you to remember — is that 15 of the 26 server-blocked journals explicitly declare the main AI bots as allowed in their robots.txt, and still return 403 at the server. This is not one weird case; it is the majority of the server-blocked set.
Nucleic Acids Research (OUP), Bioinformatics (OUP), PEDIATRICS (AAP), Psychological Review (APA), Annual Review of Biochemistry, The Journal of Immunology (AAI) — all declare all seven AI bots we track (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, ChatGPT-User, CCBot, anthropic-ai) as allowed in robots.txt. All seven receive HTTP 403 from the server when we actually request an article.
PNAS, Annals of Internal Medicine, American Journal of Psychiatry, Neurology, ACS Nano, Angewandte Chemie — declare five to six AI bots allowed; all are blocked at the server anyway.
NEJM declares ClaudeBot and anthropic-ai as the two AI bots it permits. Both are 403'd at the server. NEJM's robots.txt does not mention our research UA at all, yet the server still returns 403 — meaning the block is whitelist-based (only specific approved agents are served), not blacklist-based (named bots are refused).
Parsed from each journal's /robots.txt using Protego, April 2026. Server responses obtained from three candidate DOIs per journal across eight retry candidates.
Robots.txt is the thing a journal says it does. The server response is the thing it actually does. A PI reading OUP's robots.txt would reasonably conclude that Nucleic Acids Research welcomes GPTBot. The server disagrees, and the server is the thing AI systems actually encounter.
The five journals that are fully open
Five of the fifty top biomedical journals are simultaneously reachable by our polite crawler AND declare GPTBot, ClaudeBot, PerplexityBot, and Google-Extended all allowed in robots.txt. These are the journals a compliant AI system could actually read today without hitting either wall. In rank order on our composite discoverability framework:
| # | Journal | Publisher | Composite |
|---|---|---|---|
| 1 | PLoS ONE | Public Library of Science | 80.6 |
| 2 | The EMBO Journal | EMBO Press | 73.5 |
| 3 | Journal of Clinical Investigation | ASCI | 71.0 |
| 4 | NeuroImage | Elsevier (open access) | 68.2 |
| 5 | Chemical Society Reviews | Royal Society of Chemistry | 59.2 |
The pattern is unsurprising in retrospect: three gold-OA or diamond-OA journals (PLoS ONE, EMBO J, JCI), one Elsevier title that happens to be open access and inherits the ScienceDirect platform's AI-bot-allowed robots configuration (NeuroImage), and one selective RSC review journal. Four of the five are open-access. Open access is not the same as AI-reachable — the Nature Reviews imprints are partially open access and still blocked — but the fully-reachable list is dominated by OA titles, not subscription ones.
PLoS ONE is the only journal in the top 50 that clears 80 on our composite. This is not because PLoS ONE is a more prestigious journal than Nature — it is not. It is because PLoS ONE's platform is structurally oriented toward being read by machines: complete Highwire Press citation meta tags on every article, automatic PubMed Central deposition of the full text, all major AI crawlers explicitly allowed in a robots.txt the platform actually honours, and an llms.txt file at the root of the site describing how the content should be retrieved. It is what a journal designed today for AI-era discoverability would look like.
"Should I publish in PLoS ONE instead, then?"
No. This is the reaction the post wants to pre-empt, because it is tempting and wrong.
Citation impact and AI discoverability are different dimensions. For most careers and most manuscripts, citation impact still matters more: your tenure committee, your next grant panel, and the PIs who read your papers have not rebuilt their attention around ChatGPT. If your lab's story is best served by a Cell paper, it is best served by a Cell paper. We are not suggesting otherwise.
What we are suggesting is that papers in blocked journals have to work harder on their other discoverability surfaces, because the canonical journal page cannot do the retrieval work any more. A Cell paper with a strong preprint, linked ORCIDs, a front-loaded abstract, and a PubMed Central copy is in roughly the same shape for AI retrieval as a PLoS ONE paper is by default. A Cell paper with none of those things loses in AI retrieval in a way that would not have been visible five years ago.
What a PI can do — and it really is an afternoon's work
The short list
- Put a preprint up before you submit. bioRxiv or medRxiv for biomedical work, your institutional repository as a fallback. Preprint servers are crawler-accessible by default and generate the citation meta tags that retrieval systems look for. If your journal's policy allows it, post the preprint before the submission clock starts.
- Check the preprint has citation meta tags. View the page source on your bioRxiv page and search for
citation_title. If it is missing, the server team needs to know. This almost never happens on bioRxiv but we have seen it on institutional repositories and on a few preprint mirrors. - Link every co-author's ORCID at submission, not after. Late linking does not propagate well through the retrieval graph. ORCID is the only persistent author identifier that survives spelling drift in your name, and it is the anchor every retrieval system uses to disambiguate authors.
- Front-load the abstract. Your first two sentences should state the finding with extractable specifics — the organism, the effect size, the mechanism, the population — not background context. Retrieval systems weight the opening of the abstract disproportionately, because that is where they look for a citable claim. Context and background go in sentence three onward.
- Use green OA if it is available. Deposit the accepted manuscript somewhere reachable. If your journal's platform is in the 45 blocked ones, the green OA copy may be the only version of your paper a compliant AI crawler can read. PubMed Central auto-deposit is fine for NIH-funded work; institutional repositories are fine for the rest.
None of this requires a consultant, a budget, or a special tool. It is the same checklist we publish on the Findability Is Now a Funding Requirement post, and we keep repeating it because it is where most of the available improvement lives. The paid audit we sell is useful for diagnosing specific cases — a paper that is not showing up in Scholar, a profile that has fragmented across name variants, a lab-wide ORCID audit — but the bulk of the problem is fixable without one.
What this measurement does not prove
Seven limitations, stated before they can be pointed out by someone else.
- The central AI-citation claim is hypothesised, not observed. We have not yet run an end-to-end experiment in which real AI systems (ChatGPT, Claude, Perplexity, Google AI Overviews) are queried about papers from reachable versus blocked journals and the cited sources are recorded. That validation is phase 2 of this project. Until it runs, the claim that "your paper is cited less by AI systems because its platform blocks crawlers" is a mechanistic prediction from first-order technical observations (robots.txt + server response), not an empirical finding about AI citation behaviour.
- A real browser gets through the server blocks. A full browser with JavaScript, cookies, and a residential-IP TLS fingerprint would successfully load most of the 26 server-blocked journals. That is a real and legitimate measurement, but it answers a different question — what a human with a browser can see — not what a compliant AI crawler can retrieve. Since AI systems are the population we care about and AI crawlers do not have residential IPs, we chose the compliant-crawler measurement.
- n = 50 is a convenience sample. The 50 journals are the top-50 biomedical journals by h-index in OpenAlex, filtered to biomedical fields. They are not a random sample of scholarly publishing. The 90% effective-block rate is a property of this specific top-of-the-list set, and it is driven by a small number of publishers. Different journal sets would give different numbers. A broader survey is future work.
- Multi-sample aggregation, not n = 1. Unlike an earlier draft of this post, the current measurement samples up to three recent articles per journal (eight candidates per journal, keep the first three that pass domain-and-stub filters) and takes the per-check median for continuous scores and majority vote for binary checks. 21 of the 24 scored journals have k = 3. Two journals had only one valid candidate in the first eight; one had zero. The k = 3 sampling materially reduces the risk that a single weird template drives a score.
- Google Scholar is whitelisted on most blocked platforms. Scholar has direct agreements with most major publishers that let its crawler through. If your primary concern is classic Google Scholar visibility rather than AI retrieval, the crawler-block finding matters less — Scholar still works fine for NEJM.
- Private AI licensing deals exist and are invisible from outside. Some publishers have cut licensing agreements with specific AI companies. Under those agreements, the AI system gets content through a private API rather than by crawling. We cannot see these deals from the outside, and they are not reflected in robots.txt or server response. A paper in a "blocked" journal may still be retrievable by the specific AI systems whose publisher cut a deal with it.
- Single timepoint. Publishers' anti-bot postures change. A follow-up measurement in six months could look different — in either direction. The entire dataset will be re-run every six months for exactly this reason, with results versioned.
Methodology, data, and validation
The full 27-check framework, category weights, the build script, and the raw per-journal JSON output are at academicseo.co.uk/journal-seo-rankings/. Every check has an auditable evidence trail: the exact sample DOIs used, the API responses, the robots.txt bodies parsed, the sample-article HTML attributes. Disputes about an individual journal's score can be resolved by reproducing the specific check that contradicts the database entry. Disputes about the framework weights themselves are more interesting and welcome by email.
Phase 1 is what you are reading — the infrastructure measurement. Phase 2 is the validation step: take a stratified sample of recent articles from each of the 50 journals, submit each title as a Google Scholar query, correlate the SEO composite with the SERP rank at which the article appears, and publish the result in a separate VALIDATION.md document regardless of which way it comes out. If the correlation between composite and SERP rank is below 0.4, the framework weights are wrong and the methodology is updated to say so. If above 0.6, the methodology ships as-is. Phase 3 is the AI-citation experiment: query ChatGPT, Claude, and Perplexity about specific findings from papers across the reachable and blocked tiers, record which URLs the systems cite, and measure whether blocked-tier papers are actually cited less often or less accurately. That is the experiment that will turn the mechanism in this post from "hypothesised" to "observed."
None of this is a finished product. It is a first-pass measurement, a published methodology, and a commitment to validate both — with the limitations stated upfront rather than buried. The reason to publish it now is that the dual-mechanism finding — server-level blocks and robots.txt disallow, together shutting out 90% of the top biomedical journals from compliant AI retrieval — is strong enough that it changes how PIs should think about the discoverability of their work. Waiting for phase 2 and phase 3 to ship before saying anything would delay the one action item that would have mattered this week: put your preprint up, link your ORCID, and make sure a reachable copy of your paper exists somewhere.
Frequently asked questions
Which top biomedical journals block AI crawlers at the platform level?
In our April 2026 measurement, 35 of the top 50 biomedical journals by h-index returned a 403 or equivalent block to a polite research user-agent. The blocked set included NEJM, JAMA, Science, Cell, Nature Reviews, the Lancet family, and most journals hosted on Silverchair, Atypon, or Elsevier's linking hub platform. PLoS ONE, EMBO Journal, Nature Communications, JCI, and the main Nature title were reachable.
Is a publisher blocking AI crawlers the same as declaring it in robots.txt?
No. A publisher can declare in robots.txt that GPTBot, ClaudeBot, and PerplexityBot are allowed and still return HTTP 403 at the server level. We observed this gap on multiple journals, including NEJM. Robots.txt is a declared intent; the server response is the operational reality. AI systems cite what they can actually retrieve.
Why can Google Scholar index these journals if AI crawlers cannot?
Google Scholar's crawler is whitelisted by most major publisher platforms through direct agreements, typically based on IP range or TLS fingerprint. GPTBot, ClaudeBot, and PerplexityBot are newer, do not have these agreements, and are not whitelisted. The result is that a paper can be perfectly findable in Google Scholar and effectively invisible to ChatGPT or Perplexity.
Does this mean I should publish in PLoS ONE instead of NEJM?
No. Citation impact and AI discoverability are different things, and for most careers the citation impact matters more. What this measurement does suggest is that a paper in a crawler-blocked journal needs to work harder on its other discoverability surfaces: a findable preprint, a complete ORCID profile, a well-structured abstract that survives propagation to PubMed Central and review sites, and — where possible — green open access deposit.
What is a "polite research crawler" and why does the distinction matter?
A polite research crawler identifies itself honestly with a contact email, obeys robots.txt, rate-limits to one request per second per domain, and does not attempt to bypass authentication. Our measurement was conducted under these conditions, which approximates how a new AI crawler or a scholarly metadata aggregator would behave. A full browser would get through more publisher platforms, but that would measure what a human can see, not what AI crawlers can retrieve.
Want to know if a specific paper is retrievable by AI crawlers?
Our 115-point audit includes a platform-level access check — the same test we ran in this study, but applied to your specific paper's landing page, with a report on which AI crawlers can reach it and what structured data they can extract.
Submit your paper →