AI Search & Discovery

We measured 50 top biomedical journals. Only 5 are open to AI crawlers.

14 April 2026 17 min read

A postdoc in your lab mentions at Wednesday meeting that she asked ChatGPT about a finding she thinks your group reported first. It answered confidently, and it cited a 2025 review in Nature Reviews. It did not cite your paper. She shrugs — that is just how AI works — and you half-agree, because you have more pressing things to think about than how a chatbot attributes credit.

This post is about what actually happens when a trainee asks an AI system about your work, and why the answer depends less on the quality of your paper than on which publisher hosts it. In a reviewer-grade measurement we ran this week on the top 50 biomedical journals by h-index, we found that only five of those journals are in a state where a well-behaved AI crawler could actually read their articles. The other 45 are blocked — by two distinct mechanisms we can document from public data, neither of which most PIs know about.

The headline result

Of the top 50 biomedical journals by h-index, measured April 2026: 5 are fully open to a compliant AI crawler (PLoS ONE, The EMBO Journal, Journal of Clinical Investigation, NeuroImage, Chemical Society Reviews). 26 block at the server level, returning HTTP 403 to our research crawler regardless of what their robots.txt declares. 19 more allow our polite crawler through the server but name GPTBot, ClaudeBot, PerplexityBot, and Google-Extended as disallowed in robots.txt — including every Nature Portfolio title and every Cell Press title. A compliant AI bot would fetch none of those 19 either. Only 5 of 50 — 10% — are reachable for AI retrieval by a well-behaved crawler today.

Top 50 biomedical journals by AI-crawler accessibility A 5-row by 10-column grid of 50 squares, one per journal, colour-coded by accessibility status. 5 squares are green (fully open to a compliant AI crawler), 19 are amber (reachable by a polite crawler but disallow AI bots in robots.txt), and 26 are red (return HTTP 403 at the server level). Top 50 biomedical journals — accessibility to a compliant AI crawler Each square is one journal. Measured April 2026. 5 fully open 19 robots.txt disallows AI bots 26 server returns 403
Figure 1. Each square is one journal in the top 50 biomedical set, ranked by h-index in OpenAlex (April 2026). Green: reachable by our polite research crawler and declared open to GPTBot, ClaudeBot, PerplexityBot, and Google-Extended in robots.txt. Amber: reachable by our crawler but the publisher's robots.txt disallows at least one of those four AI bots by name — including all Nature Portfolio and all Cell Press titles. Red: returned HTTP 403 to our crawler at the server level, regardless of what robots.txt declared. The amber and red blocks are independent mechanisms; together they account for 45 of 50 journals.

What actually happens to a paper in a blocked journal

The chain is worth walking through slowly, because the step that matters for your citation count is in the middle and easy to miss.

A postdoc asks ChatGPT: "What's the latest on transcription factor X?" ChatGPT's retrieval layer searches the web, finds candidate sources — including a paper your group published last year — and tries to fetch each candidate's canonical page. For each reachable page, ChatGPT extracts the title, authors, abstract, and key claims, then decides which sources to cite in the answer and in what order.

For a paper on a blocked publisher platform, that fetch fails. Either the server refuses outright with HTTP 403 (one of our 26 server-blocked journals) or the journal's robots.txt tells the compliant bot not to fetch (one of our 19 robots-blocked journals). ChatGPT cannot extract the abstract or the citation metadata. So it picks the next-most-retrievable source that covers the same result: a 2025 Nature Reviews article that cites your paper, because Nature's article platform returns 200 and contains readable content. The review gets the citation. Your paper is mentioned inside the review's reference list — which ChatGPT did not read, because its retrieval layer stopped at the page it could parse.

Your name is still in the chain. The authority signal is not. A reader asking the system about your work is handed the review, not your paper.

This is the mechanism we are measuring. The crawler block is not refusing citation. It is refusing retrieval. Citation in a retrieval-based AI system flows to whichever page was reachable and parseable — and for 45 of 50 top biomedical journals, the paper's canonical page is not that page. This mechanism is hypothesised from the technical observations below. We have not yet run an end-to-end experiment observing real AI systems citing papers under these conditions — that validation is phase 2 of the project and is flagged clearly at the bottom of this post.
How an AI citation flows away from a blocked publisher to a fall-back source A four-step horizontal flow diagram. Step 1: a reader asks an AI system about a research finding. Step 2: the AI tries to fetch the canonical paper URL on the publisher's platform but the request is blocked. Step 3: the AI falls back to a reachable secondary source — a PubMed Central copy, a review article, or a press writeup. Step 4: the AI cites the fall-back source, not the original paper. The original author's name remains in the citation chain only as a downstream reference. Where the citation goes when the canonical paper page is blocked 1. Reader asks AI "What's the latest on transcription factor X?" PI's postdoc 2. AI fetches your paper's URL doi.org/10.1056/... 403 3. AI falls back to a reachable source • A 2025 review • A press writeup • A PMC copy 4. AI cites the reachable source Your paper is referenced through a third party The author's name still appears in the chain. The authority signal does not. A reader looking for the primary source is handed a secondary one. This mechanism is hypothesised from the technical observations in the post — see Limitations §1.
Figure 3. The retrieval-time consequence of a server-level or robots.txt block. A retrieval-grounded AI system cannot cite a page it cannot read, so it cites whatever page was reachable that covered the same finding — typically a review article, a press writeup, or a PubMed Central deposit. The PI's name remains in the citation graph downstream, but the primary attribution shifts to the secondary source. We have not yet run the end-to-end experiment that would observe this directly with real AI systems; phase 3 of the project is the validation, see Limitations §1.

The two blocking mechanisms, from the data

The two mechanisms that prevent AI crawlers from reaching biomedical journal articles Two parallel flow diagrams. The top flow shows server-level blocking: an AI bot sends a request to a publisher server, which returns HTTP 403 regardless of what the bot identifies as. The bottom flow shows robots.txt-level blocking: the bot reads the publisher's robots.txt, finds itself listed as disallowed, and stops without sending the request. Both flows end at "no retrieval, no AI citation." Two ways an AI crawler is shut out of a journal article 1. Server-level block — 26 of 50 journals UA, JA3 fingerprint, or IP not on whitelist → 403 Forbidden AI crawler GPTBot ClaudeBot GET /article/... Publisher server checks TLS / IP / JA3 fingerprint 403 No retrieval No abstract, no citation_* meta tags 15 of these 26 journals declare AI bots allowed in robots.txt anyway 2. robots.txt disallow — 19 additional journals A compliant bot reads the rule and never sends the request AI crawler GPTBot ClaudeBot GET /robots.txt robots.txt User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / No retrieval Compliant bots obey the rule Includes every Nature Portfolio and every Cell Press title
Figure 2. Two independent blocking mechanisms produce the same outcome from the perspective of a compliant AI crawler. Top: the server refuses the request based on TLS fingerprint or IP whitelist, regardless of the bot's identity or what robots.txt says. We documented this on 26 of 50 journals; 15 of those 26 declare the major AI bots as allowed in robots.txt anyway, making the declared posture decorative. Bottom: the publisher's robots.txt names GPTBot, ClaudeBot, PerplexityBot, or Google-Extended as disallowed; a well-behaved bot reads the rule before fetching and never sends the request. We documented this on 19 additional journals — including every Nature Portfolio and every Cell Press title.

Mechanism one — server-level blocks (26/50 journals)

Twenty-six journals in our set returned HTTP 403 or the equivalent to our research crawler (UA string: AcademicSEO-Research/1.0 (+mailto:[email protected])). We attempted up to eight different recent DOIs per journal, so this is not a one-article fluke — every attempt returned 403.

To rule out the trivial explanation that publishers just block our specific UA string, we ran a diagnostic on nine representative blocked hosts with three additional user agents — a generic Firefox 128 string, a generic Chrome 124/macOS string, and our research UA. Every combination of host × UA returned HTTP 403. The block is therefore not at the UA-string layer. It is implemented at a deeper level (TLS fingerprint, JA3/JA4 hash, or header heuristics), which means it cannot be defeated by changing the user-agent string. Any AI crawler today that is not on the publisher's whitelist faces the same 403.

The server-blocked set, by publisher:

PublisherBlocked journals in our top-50 setn
Lippincott Williams & WilkinsCirculation, Journal of Clinical Oncology, Neurology3
Oxford University PressNucleic Acids Research, Bioinformatics2
Rockefeller University PressThe Journal of Cell Biology, The Journal of Experimental Medicine2
Nature PortfolioNature Reviews Cancer, Nature Reviews Neuroscience2
American Medical AssociationJAMA, Archives of General Psychiatry2
Other (one journal each)NEJM (Massachusetts Medical Society), Science (AAAS), PNAS (NAS), Pediatrics (AAP), The Journal of Immunology (AAI), ACS Nano (ACS), Angewandte Chemie (Wiley), Annual Review of Biochemistry (Annual Reviews), American Journal of Psychiatry (APA), Psychological Review (APA), Physiological Reviews (APS), Annals of Internal Medicine (ACP), Journal of Neuroscience (SfN), Journal of Geophysical Research: Atmospheres (AGU), Blood (ASH, listed in OpenAlex as Elsevier BV)15

These journals cluster on a small number of publisher-platform stacks, but we are not going to name the specific middleware vendors because we cannot verify platform attribution at the HTTP level — our probe showed most of the hosts sit behind Cloudflare regardless of underlying platform, and the Server header does not distinguish them. Attributing "Silverchair" or "Atypon" by publisher name would be inference, not observation, and this post is being held to a reviewer standard.

Mechanism two — robots.txt disallow (19/50 additional journals)

Nineteen further journals passed our server-level fetch but declare at least one of GPTBot, ClaudeBot, PerplexityBot, or Google-Extended as disallowed in their robots.txt. Our polite crawler reached them because our UA string is not any of those named bots — but a compliant AI bot that self-identifies honestly and obeys robots.txt, which is what GPTBot, ClaudeBot, and PerplexityBot all do according to OpenAI, Anthropic, and Perplexity's published crawler documentation, would not fetch these pages. The net operational result for an AI system is identical to a 403.

This set includes, among others:

We validated these robots.txt parses with two independent parsers: our own hand-rolled implementation, and the Protego library used by Scrapy (a spec-compliant RFC 9309 parser). On 160 bot × journal combinations the two parsers agreed in 100% of cases. The declared-blocked claim is therefore robust to parser error.

The declared-vs-observed gap, across all 26 server-blocked journals

The most instructive finding — the one I want you to remember — is that 15 of the 26 server-blocked journals explicitly declare the main AI bots as allowed in their robots.txt, and still return 403 at the server. This is not one weird case; it is the majority of the server-blocked set.

The declared-vs-observed gap — selected cases

Nucleic Acids Research (OUP), Bioinformatics (OUP), PEDIATRICS (AAP), Psychological Review (APA), Annual Review of Biochemistry, The Journal of Immunology (AAI) — all declare all seven AI bots we track (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, ChatGPT-User, CCBot, anthropic-ai) as allowed in robots.txt. All seven receive HTTP 403 from the server when we actually request an article.

PNAS, Annals of Internal Medicine, American Journal of Psychiatry, Neurology, ACS Nano, Angewandte Chemie — declare five to six AI bots allowed; all are blocked at the server anyway.

NEJM declares ClaudeBot and anthropic-ai as the two AI bots it permits. Both are 403'd at the server. NEJM's robots.txt does not mention our research UA at all, yet the server still returns 403 — meaning the block is whitelist-based (only specific approved agents are served), not blacklist-based (named bots are refused).

Parsed from each journal's /robots.txt using Protego, April 2026. Server responses obtained from three candidate DOIs per journal across eight retry candidates.

Robots.txt is the thing a journal says it does. The server response is the thing it actually does. A PI reading OUP's robots.txt would reasonably conclude that Nucleic Acids Research welcomes GPTBot. The server disagrees, and the server is the thing AI systems actually encounter.

The five journals that are fully open

Five of the fifty top biomedical journals are simultaneously reachable by our polite crawler AND declare GPTBot, ClaudeBot, PerplexityBot, and Google-Extended all allowed in robots.txt. These are the journals a compliant AI system could actually read today without hitting either wall. In rank order on our composite discoverability framework:

#JournalPublisherComposite
1PLoS ONEPublic Library of Science80.6
2The EMBO JournalEMBO Press73.5
3Journal of Clinical InvestigationASCI71.0
4NeuroImageElsevier (open access)68.2
5Chemical Society ReviewsRoyal Society of Chemistry59.2

The pattern is unsurprising in retrospect: three gold-OA or diamond-OA journals (PLoS ONE, EMBO J, JCI), one Elsevier title that happens to be open access and inherits the ScienceDirect platform's AI-bot-allowed robots configuration (NeuroImage), and one selective RSC review journal. Four of the five are open-access. Open access is not the same as AI-reachable — the Nature Reviews imprints are partially open access and still blocked — but the fully-reachable list is dominated by OA titles, not subscription ones.

PLoS ONE is the only journal in the top 50 that clears 80 on our composite. This is not because PLoS ONE is a more prestigious journal than Nature — it is not. It is because PLoS ONE's platform is structurally oriented toward being read by machines: complete Highwire Press citation meta tags on every article, automatic PubMed Central deposition of the full text, all major AI crawlers explicitly allowed in a robots.txt the platform actually honours, and an llms.txt file at the root of the site describing how the content should be retrieved. It is what a journal designed today for AI-era discoverability would look like.

"Should I publish in PLoS ONE instead, then?"

No. This is the reaction the post wants to pre-empt, because it is tempting and wrong.

Citation impact and AI discoverability are different dimensions. For most careers and most manuscripts, citation impact still matters more: your tenure committee, your next grant panel, and the PIs who read your papers have not rebuilt their attention around ChatGPT. If your lab's story is best served by a Cell paper, it is best served by a Cell paper. We are not suggesting otherwise.

What we are suggesting is that papers in blocked journals have to work harder on their other discoverability surfaces, because the canonical journal page cannot do the retrieval work any more. A Cell paper with a strong preprint, linked ORCIDs, a front-loaded abstract, and a PubMed Central copy is in roughly the same shape for AI retrieval as a PLoS ONE paper is by default. A Cell paper with none of those things loses in AI retrieval in a way that would not have been visible five years ago.

What a PI can do — and it really is an afternoon's work

The short list

None of this requires a consultant, a budget, or a special tool. It is the same checklist we publish on the Findability Is Now a Funding Requirement post, and we keep repeating it because it is where most of the available improvement lives. The paid audit we sell is useful for diagnosing specific cases — a paper that is not showing up in Scholar, a profile that has fragmented across name variants, a lab-wide ORCID audit — but the bulk of the problem is fixable without one.

What this measurement does not prove

Seven limitations, stated before they can be pointed out by someone else.

  1. The central AI-citation claim is hypothesised, not observed. We have not yet run an end-to-end experiment in which real AI systems (ChatGPT, Claude, Perplexity, Google AI Overviews) are queried about papers from reachable versus blocked journals and the cited sources are recorded. That validation is phase 2 of this project. Until it runs, the claim that "your paper is cited less by AI systems because its platform blocks crawlers" is a mechanistic prediction from first-order technical observations (robots.txt + server response), not an empirical finding about AI citation behaviour.
  2. A real browser gets through the server blocks. A full browser with JavaScript, cookies, and a residential-IP TLS fingerprint would successfully load most of the 26 server-blocked journals. That is a real and legitimate measurement, but it answers a different question — what a human with a browser can see — not what a compliant AI crawler can retrieve. Since AI systems are the population we care about and AI crawlers do not have residential IPs, we chose the compliant-crawler measurement.
  3. n = 50 is a convenience sample. The 50 journals are the top-50 biomedical journals by h-index in OpenAlex, filtered to biomedical fields. They are not a random sample of scholarly publishing. The 90% effective-block rate is a property of this specific top-of-the-list set, and it is driven by a small number of publishers. Different journal sets would give different numbers. A broader survey is future work.
  4. Multi-sample aggregation, not n = 1. Unlike an earlier draft of this post, the current measurement samples up to three recent articles per journal (eight candidates per journal, keep the first three that pass domain-and-stub filters) and takes the per-check median for continuous scores and majority vote for binary checks. 21 of the 24 scored journals have k = 3. Two journals had only one valid candidate in the first eight; one had zero. The k = 3 sampling materially reduces the risk that a single weird template drives a score.
  5. Google Scholar is whitelisted on most blocked platforms. Scholar has direct agreements with most major publishers that let its crawler through. If your primary concern is classic Google Scholar visibility rather than AI retrieval, the crawler-block finding matters less — Scholar still works fine for NEJM.
  6. Private AI licensing deals exist and are invisible from outside. Some publishers have cut licensing agreements with specific AI companies. Under those agreements, the AI system gets content through a private API rather than by crawling. We cannot see these deals from the outside, and they are not reflected in robots.txt or server response. A paper in a "blocked" journal may still be retrievable by the specific AI systems whose publisher cut a deal with it.
  7. Single timepoint. Publishers' anti-bot postures change. A follow-up measurement in six months could look different — in either direction. The entire dataset will be re-run every six months for exactly this reason, with results versioned.

Methodology, data, and validation

The full 27-check framework, category weights, the build script, and the raw per-journal JSON output are at academicseo.co.uk/journal-seo-rankings/. Every check has an auditable evidence trail: the exact sample DOIs used, the API responses, the robots.txt bodies parsed, the sample-article HTML attributes. Disputes about an individual journal's score can be resolved by reproducing the specific check that contradicts the database entry. Disputes about the framework weights themselves are more interesting and welcome by email.

Phase 1 is what you are reading — the infrastructure measurement. Phase 2 is the validation step: take a stratified sample of recent articles from each of the 50 journals, submit each title as a Google Scholar query, correlate the SEO composite with the SERP rank at which the article appears, and publish the result in a separate VALIDATION.md document regardless of which way it comes out. If the correlation between composite and SERP rank is below 0.4, the framework weights are wrong and the methodology is updated to say so. If above 0.6, the methodology ships as-is. Phase 3 is the AI-citation experiment: query ChatGPT, Claude, and Perplexity about specific findings from papers across the reachable and blocked tiers, record which URLs the systems cite, and measure whether blocked-tier papers are actually cited less often or less accurately. That is the experiment that will turn the mechanism in this post from "hypothesised" to "observed."

None of this is a finished product. It is a first-pass measurement, a published methodology, and a commitment to validate both — with the limitations stated upfront rather than buried. The reason to publish it now is that the dual-mechanism finding — server-level blocks and robots.txt disallow, together shutting out 90% of the top biomedical journals from compliant AI retrieval — is strong enough that it changes how PIs should think about the discoverability of their work. Waiting for phase 2 and phase 3 to ship before saying anything would delay the one action item that would have mattered this week: put your preprint up, link your ORCID, and make sure a reachable copy of your paper exists somewhere.

Frequently asked questions

Which top biomedical journals block AI crawlers at the platform level?

In our April 2026 measurement, 35 of the top 50 biomedical journals by h-index returned a 403 or equivalent block to a polite research user-agent. The blocked set included NEJM, JAMA, Science, Cell, Nature Reviews, the Lancet family, and most journals hosted on Silverchair, Atypon, or Elsevier's linking hub platform. PLoS ONE, EMBO Journal, Nature Communications, JCI, and the main Nature title were reachable.

Is a publisher blocking AI crawlers the same as declaring it in robots.txt?

No. A publisher can declare in robots.txt that GPTBot, ClaudeBot, and PerplexityBot are allowed and still return HTTP 403 at the server level. We observed this gap on multiple journals, including NEJM. Robots.txt is a declared intent; the server response is the operational reality. AI systems cite what they can actually retrieve.

Why can Google Scholar index these journals if AI crawlers cannot?

Google Scholar's crawler is whitelisted by most major publisher platforms through direct agreements, typically based on IP range or TLS fingerprint. GPTBot, ClaudeBot, and PerplexityBot are newer, do not have these agreements, and are not whitelisted. The result is that a paper can be perfectly findable in Google Scholar and effectively invisible to ChatGPT or Perplexity.

Does this mean I should publish in PLoS ONE instead of NEJM?

No. Citation impact and AI discoverability are different things, and for most careers the citation impact matters more. What this measurement does suggest is that a paper in a crawler-blocked journal needs to work harder on its other discoverability surfaces: a findable preprint, a complete ORCID profile, a well-structured abstract that survives propagation to PubMed Central and review sites, and — where possible — green open access deposit.

What is a "polite research crawler" and why does the distinction matter?

A polite research crawler identifies itself honestly with a contact email, obeys robots.txt, rate-limits to one request per second per domain, and does not attempt to bypass authentication. Our measurement was conducted under these conditions, which approximates how a new AI crawler or a scholarly metadata aggregator would behave. A full browser would get through more publisher platforms, but that would measure what a human can see, not what AI crawlers can retrieve.

Want to know if a specific paper is retrievable by AI crawlers?

Our 115-point audit includes a platform-level access check — the same test we ran in this study, but applied to your specific paper's landing page, with a report on which AI crawlers can reach it and what structured data they can extract.

Submit your paper →