Which top biomedical journals are fully open to AI crawlers?

In our April 2026 measurement, only 5 of the top 50 biomedical journals by h-index were simultaneously reachable by a polite research crawler and declared open to the four main AI bots (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) in robots.txt: PLoS ONE, The EMBO Journal, Journal of Clinical Investigation, NeuroImage, and Chemical Society Reviews. The other 45 were blocked either at the server level (26 journals) or by robots.txt AI-bot disallow (19 journals) or both.

Why is Nature reachable but Nature Reviews is not?

Nature and Nature Communications serve articles from a platform configuration that our polite crawler can fetch, they appear in our scored set. But all Nature titles, including the main Nature and all Nature Reviews imprints, declare GPTBot, ClaudeBot, PerplexityBot, and Google-Extended as disallowed in their robots.txt. A compliant AI bot that obeys robots.txt would therefore not fetch any Nature-branded paper. Two Nature Reviews titles additionally hit server-level blocks during our build, placing them in the fully-blocked tier. Within Springer Nature, platform configuration and robots.txt posture vary between the main and Reviews imprints.

We Measured 50 Top Biomedical Journals. Only 5 Are Open to AI Crawlers

Q: Is a publisher blocking AI crawlers the same as declaring it in robots.txt?

No, and that distinction is the main empirical finding of this study. We observed two independent blocking mechanisms. First, 26 journals returned HTTP 403 to our research crawler at the server level regardless of what robots.txt declared, including journals like Nucleic Acids Research, Bioinformatics, PEDIATRICS, and Annual Review of Biochemistry whose robots.txt explicitly allowed all seven AI bots we track. Second, 19 reachable journals allow our polite UA through the server but disallow GPTBot, ClaudeBot, PerplexityBot, and Google-Extended by name in robots.txt, including all Nature Portfolio titles and all Cell Press titles. A compliant AI bot that self-identifies honestly and obeys robots.txt would fetch neither set.

A postdoc in your lab mentions at Wednesday meeting that she asked ChatGPT about a finding she thinks your group reported first. It answered confidently, and it cited a 2025 review in Nature Reviews. It did not cite your paper. She shrugs, that is just how AI works, and you half-agree, because you have more pressing things to think about than how a chatbot attributes credit.

This post is about what actually happens when a trainee asks an AI system about your work, and why the answer depends less on the quality of your paper than on which publisher hosts it. In a reviewer-grade measurement we ran this week on the top 50 biomedical journals by h-index, we found that only five of those journals are in a state where a well-behaved AI crawler could actually read their articles. The other 45 are blocked, by two distinct mechanisms we can document from public data, neither of which most PIs know about. The full per-journal data, including every sample DOI and raw check result, is published as our public journal SEO ranking (now extended to 100 journals), and the complete method sits at academicseo.co.uk/journal-seo-rankings/.

The headline result

Of the top 50 biomedical journals by h-index, measured April 2026: 5 are fully open to a compliant AI crawler (PLoS ONE, The EMBO Journal, Journal of Clinical Investigation, NeuroImage, Chemical Society Reviews). 26 block at the server level, returning HTTP 403 to our research crawler regardless of what their robots.txt declares. 19 more allow our polite crawler through the server but name GPTBot, ClaudeBot, PerplexityBot, and Google-Extended as disallowed in robots.txt, including every Nature Portfolio title and every Cell Press title. A compliant AI bot would fetch none of those 19 either. Only 5 of 50, 10%, are reachable for AI retrieval by a well-behaved crawler today.

Figure 1. Each square is one journal in the top 50 biomedical set, ranked by h-index in OpenAlex (April 2026). Green: reachable by our polite research crawler and declared open to GPTBot, ClaudeBot, PerplexityBot, and Google-Extended in robots.txt. Amber: reachable by our crawler but the publisher's robots.txt disallows at least one of those four AI bots by name, including all Nature Portfolio and all Cell Press titles. Red: returned HTTP 403 to our crawler at the server level, regardless of what robots.txt declared. The amber and red blocks are independent mechanisms; together they account for 45 of 50 journals.

What actually happens to a paper in a blocked journal

The chain is worth walking through slowly, because the step that matters for your citation count is in the middle and easy to miss.

A postdoc asks ChatGPT: "What's the latest on transcription factor X?" ChatGPT's retrieval layer searches the web, finds candidate sources, including a paper your group published last year, and tries to fetch each candidate's canonical page. For each reachable page, ChatGPT extracts the title, authors, abstract, and key claims, then decides which sources to cite in the answer and in what order.

For a paper on a blocked publisher platform, that fetch fails. Either the server refuses outright with HTTP 403 (one of our 26 server-blocked journals) or the journal's robots.txt tells the compliant bot not to fetch (one of our 19 robots-blocked journals). ChatGPT cannot extract the abstract or the citation metadata. So it picks the next-most-retrievable source that covers the same result: a 2025 Nature Reviews article that cites your paper, because Nature's article platform returns 200 and contains readable content. The review gets the citation. Your paper is mentioned inside the review's reference list, which ChatGPT did not read, because its retrieval layer stopped at the page it could parse.

Your name is still in the chain. The authority signal is not. A reader asking the system about your work is handed the review, not your paper.

This is the mechanism we are measuring. The crawler block is not refusing citation. It is refusing retrieval. Citation in a retrieval-based AI system flows to whichever page was reachable and parseable, and for 45 of 50 top biomedical journals, the paper's canonical page is not that page. This mechanism is hypothesised from the technical observations below. We have not yet run an end-to-end experiment observing real AI systems citing papers under these conditions, that validation is phase 2 of the project and is flagged clearly at the bottom of this post.

through a third party The author's name still appears in the chain. The authority signal does not. A reader looking for the primary source is handed a secondary one. This mechanism is hypothesised from the technical observations in the post, see Limitations §1.

Figure 3. The retrieval-time consequence of a server-level or robots.txt block. A retrieval-grounded AI system cannot cite a page it cannot read, so it cites whatever page was reachable that covered the same finding, typically a review article, a press writeup, or a PubMed Central deposit. The PI's name remains in the citation graph downstream, but the primary attribution shifts to the secondary source. We have not yet run the end-to-end experiment that would observe this directly with real AI systems; phase 3 of the project is the validation, see Limitations §1.

The two blocking mechanisms, from the data

allowed in robots.txt anyway 2. robots.txt disallow, 19 additional journals A compliant bot reads the rule and never sends the request AI crawler GPTBot ClaudeBot GET /robots.txt robots.txt User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / ✕ No retrieval Compliant bots obey the rule Includes every Nature Portfolio and every Cell Press title

Figure 2. Two independent blocking mechanisms produce the same outcome from the perspective of a compliant AI crawler. Top: the server refuses the request based on TLS fingerprint or IP whitelist, regardless of the bot's identity or what robots.txt says. We documented this on 26 of 50 journals; 15 of those 26 declare the major AI bots as allowed in robots.txt anyway, making the declared posture decorative. Bottom: the publisher's robots.txt names GPTBot, ClaudeBot, PerplexityBot, or Google-Extended as disallowed; a well-behaved bot reads the rule before fetching and never sends the request. We documented this on 19 additional journals, including every Nature Portfolio and every Cell Press title.

Mechanism one: server-level blocks (26/50 journals)

Twenty-six journals in our set returned HTTP 403 or the equivalent to our research crawler (UA string: AcademicSEO-Research/1.0 (+mailto:gkumar@academicseo.co.uk)). We attempted up to eight different recent DOIs per journal, so this is not a one-article fluke, every attempt returned 403.

To rule out the trivial explanation that publishers just block our specific UA string, we ran a diagnostic on nine representative blocked hosts with three additional user agents, a generic Firefox 128 string, a generic Chrome 124/macOS string, and our research UA. Every combination of host × UA returned HTTP 403. The block is therefore not at the UA-string layer. It is implemented at a deeper level (TLS fingerprint, JA3/JA4 hash, or header heuristics), which means it cannot be defeated by changing the user-agent string. Any AI crawler today that is not on the publisher's whitelist faces the same 403.

The server-blocked set, by publisher:

Publisher	Blocked journals in our top-50 set	n
Lippincott Williams & Wilkins	Circulation, Journal of Clinical Oncology, Neurology	3
Oxford University Press	Nucleic Acids Research, Bioinformatics	2
Rockefeller University Press	The Journal of Cell Biology, The Journal of Experimental Medicine	2
Nature Portfolio	Nature Reviews Cancer, Nature Reviews Neuroscience	2
American Medical Association	JAMA, Archives of General Psychiatry	2
Other (one journal each)	NEJM (Massachusetts Medical Society), Science (AAAS), PNAS (NAS), Pediatrics (AAP), The Journal of Immunology (AAI), ACS Nano (ACS), Angewandte Chemie (Wiley), Annual Review of Biochemistry (Annual Reviews), American Journal of Psychiatry (APA), Psychological Review (APA), Physiological Reviews (APS), Annals of Internal Medicine (ACP), Journal of Neuroscience (SfN), Journal of Geophysical Research: Atmospheres (AGU), Blood (ASH, listed in OpenAlex as Elsevier BV)	15

These journals cluster on a small number of publisher-platform stacks, but we are not going to name the specific middleware vendors because we cannot verify platform attribution at the HTTP level, our probe showed most of the hosts sit behind Cloudflare regardless of underlying platform, and the Server header does not distinguish them. Attributing "Silverchair" or "Atypon" by publisher name would be inference, not observation, and this post is being held to a reviewer standard.

Mechanism two: robots.txt disallow (19/50 additional journals)

Nineteen further journals passed our server-level fetch but declare at least one of GPTBot, ClaudeBot, PerplexityBot, or Google-Extended as disallowed in their robots.txt. Our polite crawler reached them because our UA string is not any of those named bots, but a compliant AI bot that self-identifies honestly and obeys robots.txt, which is what GPTBot, ClaudeBot, and PerplexityBot all do according to OpenAI, Anthropic, and Perplexity's published crawler documentation, would not fetch these pages. The net operational result for an AI system is identical to a 403.

This set includes, among others:

All reachable Nature Portfolio titles, Nature, Nature Communications, Nature Medicine, Nature Genetics, Nature Biotechnology, Nature Neuroscience, Nature Immunology, and the Nature Reviews imprints we could reach. Every one of these declares GPTBot, ClaudeBot, PerplexityBot, and Google-Extended disallowed.
All reachable Cell Press titles, Cell, Neuron, Immunity, Molecular Cell, via their ScienceDirect-hosted article pages. Elsevier's sciencedirect.com platform also declares the four main AI bots disallowed.
Other Elsevier titles, Gastroenterology, The Lancet Oncology, the Journal of the American College of Cardiology, the Journal of Biological Chemistry.
Genes & Development (Cold Spring Harbor).

We validated these robots.txt parses with two independent parsers: our own hand-rolled implementation, and the Protego library used by Scrapy (a spec-compliant RFC 9309 parser). On 160 bot × journal combinations the two parsers agreed in 100% of cases. The declared-blocked claim is therefore robust to parser error.

The declared-vs-observed gap, across all 26 server-blocked journals

The most instructive finding, the one I want you to remember, is that 15 of the 26 server-blocked journals explicitly declare the main AI bots as allowed in their robots.txt, and still return 403 at the server. This is not one weird case; it is the majority of the server-blocked set.

The declared-vs-observed gap, selected cases

Nucleic Acids Research (OUP), Bioinformatics (OUP), PEDIATRICS (AAP), Psychological Review (APA), Annual Review of Biochemistry, The Journal of Immunology (AAI), all declare all seven AI bots we track (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, ChatGPT-User, CCBot, anthropic-ai) as allowed in robots.txt. All seven receive HTTP 403 from the server when we actually request an article.

PNAS, Annals of Internal Medicine, American Journal of Psychiatry, Neurology, ACS Nano, Angewandte Chemie, declare five to six AI bots allowed; all are blocked at the server anyway.

NEJM declares ClaudeBot and anthropic-ai as the two AI bots it permits. Both are 403'd at the server. NEJM's robots.txt does not mention our research UA at all, yet the server still returns 403, meaning the block is whitelist-based (only specific approved agents are served), not blacklist-based (named bots are refused).

Parsed from each journal's /robots.txt using Protego, April 2026. Server responses obtained from three candidate DOIs per journal across eight retry candidates.

Robots.txt is the thing a journal says it does. The server response is the thing it actually does. A PI reading OUP's robots.txt would reasonably conclude that Nucleic Acids Research welcomes GPTBot. The server disagrees, and the server is the thing AI systems actually encounter.

The five journals that are fully open

Five of the fifty top biomedical journals are simultaneously reachable by our polite crawler AND declare GPTBot, ClaudeBot, PerplexityBot, and Google-Extended all allowed in robots.txt. These are the journals a compliant AI system could actually read today without hitting either wall. In rank order on our composite discoverability framework:

#	Journal	Publisher	Composite
1	The EMBO Journal	EMBO Press	85.4
2	PLoS ONE	Public Library of Science	79.1
3	Chemical Society Reviews	Royal Society of Chemistry	67.0
4	NeuroImage	Elsevier (open access)	65.0
5	Journal of Clinical Investigation	ASCI	49.8

The pattern is unsurprising in retrospect: three gold-OA or diamond-OA journals (PLoS ONE, EMBO J, JCI), one Elsevier title that happens to be open access and inherits the ScienceDirect platform's AI-bot-allowed robots configuration (NeuroImage), and one selective RSC review journal. Four of the five are open-access. Open access is not the same as AI-reachable, the Nature Reviews imprints are partially open access and still blocked, but the fully-reachable list is dominated by OA titles, not subscription ones.

The EMBO Journal is the only journal in the top 50 that clears 85 on our composite, and one of only two (with PLoS ONE) to clear 75. Neither of those is a more prestigious journal than Nature, neither is. They score at the top because their platforms are structurally oriented toward being read by machines: complete Highwire Press citation meta tags on every article, Schema.org Article JSON-LD, automatic PubMed Central deposition of the full text, all major AI crawlers explicitly allowed in a robots.txt the platform actually honours, and, in PLoS ONE's case, an llms.txt file at the root of the site describing how the content should be retrieved. It is what a journal designed today for AI-era discoverability looks like.

A note on what changed since the first version of this ranking. The composite scores above reflect a second pass over the top 50 in which we harvested publisher HTML for 34 journals that Cloudflare/Akamai blocked to our polite research crawler, so their on-page SEO could be measured on real article markup rather than left at the block-floor of 26.7. We used a real, non-automated Chrome instance driven by the nodriver library for the harvest, at roughly one request per twelve seconds per publisher domain, and scored the resulting HTML against the same 17 checks. The crawler-blocking finding, 5 open, 26 server-blocked, 19 robots-blocked, is unchanged by that work, because it is measured at the HTTP layer before any article HTML is fetched. What changed is that blocked publishers now have a real composite score reflecting their actual on-page compliance, not a placeholder.

"Should I publish in PLoS ONE instead, then?"

No. This is the reaction the post wants to pre-empt, because it is tempting and wrong.

Citation impact and AI discoverability are different dimensions. For most careers and most manuscripts, citation impact still matters more: your tenure committee, your next grant panel, and the PIs who read your papers have not rebuilt their attention around ChatGPT. If your lab's story is best served by a Cell paper, it is best served by a Cell paper. We are not suggesting otherwise. For the longer argument on why journal-level impact factor has become a weaker predictor of actual reach than venue-independent signals (citations, Altmetric, retrieval), see beyond impact factor: what actually predicts citations.

What we are suggesting is that papers in blocked journals have to work harder on their other discoverability surfaces, because the canonical journal page cannot do the retrieval work any more. A Cell paper with a strong preprint, linked ORCIDs, a front-loaded abstract, and a PubMed Central copy is in roughly the same shape for AI retrieval as a PLoS ONE paper is by default. A Cell paper with none of those things loses in AI retrieval in a way that would not have been visible five years ago.

What a PI can do: and it really is an afternoon's work

The short list

Put a preprint up before you submit. bioRxiv or medRxiv for biomedical work, your institutional repository as a fallback. Preprint servers are crawler-accessible by default and generate the citation meta tags that retrieval systems look for. If your journal's policy allows it, post the preprint before the submission clock starts.
Check the preprint has citation meta tags. View the page source on your bioRxiv page and search for citation_title. If it is missing, the server team needs to know. This almost never happens on bioRxiv but we have seen it on institutional repositories and on a few preprint mirrors.
Link every co-author's ORCID at submission, not after. Late linking does not propagate well through the retrieval graph. ORCID is the only persistent author identifier that survives spelling drift in your name, and it is the anchor every retrieval system uses to disambiguate authors.
Front-load the abstract. Your first two sentences should state the finding with extractable specifics, the organism, the effect size, the mechanism, the population, not background context. Retrieval systems weight the opening of the abstract disproportionately, because that is where they look for a citable claim. Context and background go in sentence three onward.
Use green OA if it is available. Deposit the accepted manuscript somewhere reachable. If your journal's platform is in the 45 blocked ones, the green OA copy may be the only version of your paper a compliant AI crawler can read. PubMed Central auto-deposit is fine for NIH-funded work; institutional repositories are fine for the rest.

None of this requires a consultant, a budget, or a special tool. It is the same checklist we publish on the Findability Is Now a Funding Requirement post, and we keep repeating it because it is where most of the available improvement lives. The paid audit we sell is useful for diagnosing specific cases, a paper that is not showing up in Scholar, a profile that has fragmented across name variants, a lab-wide ORCID audit, but the bulk of the problem is fixable without one.

What this measurement does not prove

Seven limitations, stated before they can be pointed out by someone else.

The central AI-citation claim is hypothesised, not observed. We have not yet run an end-to-end experiment in which real AI systems (ChatGPT, Claude, Perplexity, Google AI Overviews) are queried about papers from reachable versus blocked journals and the cited sources are recorded. That validation is phase 2 of this project. Until it runs, the claim that "your paper is cited less by AI systems because its platform blocks crawlers" is a mechanistic prediction from first-order technical observations (robots.txt + server response), not an empirical finding about AI citation behaviour.
A real browser gets through the server blocks. A full browser with JavaScript, cookies, and a residential-IP TLS fingerprint would successfully load most of the 26 server-blocked journals. That is a real and legitimate measurement, but it answers a different question, what a human with a browser can see, not what a compliant AI crawler can retrieve. Since AI systems are the population we care about and AI crawlers do not have residential IPs, we chose the compliant-crawler measurement.
n = 50 is a convenience sample. The 50 journals are the top-50 biomedical journals by h-index in OpenAlex, filtered to biomedical fields. They are not a random sample of scholarly publishing. The 90% effective-block rate is a property of this specific top-of-the-list set, and it is driven by a small number of publishers. Different journal sets would give different numbers. A broader survey is future work.
Multi-sample aggregation, not n = 1. Unlike an earlier draft of this post, the current measurement samples up to three recent articles per journal (eight candidates per journal, keep the first three that pass domain-and-stub filters) and takes the per-check median for continuous scores and majority vote for binary checks. 21 of the 24 scored journals have k = 3. Two journals had only one valid candidate in the first eight; one had zero. The k = 3 sampling materially reduces the risk that a single weird template drives a score.
Google Scholar is whitelisted on most blocked platforms. Scholar has direct agreements with most major publishers that let its crawler through. If your primary concern is classic Google Scholar visibility rather than AI retrieval, the crawler-block finding matters less, Scholar still works fine for NEJM.
Private AI licensing deals exist and are invisible from outside. Some publishers have cut licensing agreements with specific AI companies. Under those agreements, the AI system gets content through a private API rather than by crawling. We cannot see these deals from the outside, and they are not reflected in robots.txt or server response. A paper in a "blocked" journal may still be retrievable by the specific AI systems whose publisher cut a deal with it.
Single timepoint. Publishers' anti-bot postures change. A follow-up measurement in six months could look different, in either direction. The entire dataset will be re-run every six months for exactly this reason, with results versioned.

Methodology, data, and validation

The full 27-check framework, category weights, the build script, and the raw per-journal JSON output are at academicseo.co.uk/journal-seo-rankings/. Every check has an auditable evidence trail: the exact sample DOIs used, the API responses, the robots.txt bodies parsed, the sample-article HTML attributes. Disputes about an individual journal's score can be resolved by reproducing the specific check that contradicts the database entry. Disputes about the framework weights themselves are more interesting and welcome by email.

Phase 1 is what you are reading, the infrastructure measurement. Phase 2 is the validation step: take a stratified sample of recent articles from each of the 50 journals, submit each title as a Google Scholar query, correlate the SEO composite with the SERP rank at which the article appears, and publish the result in a separate VALIDATION.md document regardless of which way it comes out. If the correlation between composite and SERP rank is below 0.4, the framework weights are wrong and the methodology is updated to say so. If above 0.6, the methodology ships as-is. Phase 3 is the AI-citation experiment: query ChatGPT, Claude, and Perplexity about specific findings from papers across the reachable and blocked tiers, record which URLs the systems cite, and measure whether blocked-tier papers are actually cited less often or less accurately. That is the experiment that will turn the mechanism in this post from "hypothesised" to "observed."

None of this is a finished product. It is a first-pass measurement, a published methodology, and a commitment to validate both, with the limitations stated upfront rather than buried. The reason to publish it now is that the dual-mechanism finding, server-level blocks and robots.txt disallow, together shutting out 90% of the top biomedical journals from compliant AI retrieval, is strong enough that it changes how PIs should think about the discoverability of their work. Waiting for phase 2 and phase 3 to ship before saying anything would delay the one action item that would have mattered this week: put your preprint up, link your ORCID, and make sure a reachable copy of your paper exists somewhere.

Frequently asked questions

Which top biomedical journals block AI crawlers at the platform level?

In our April 2026 measurement, 35 of the top 50 biomedical journals by h-index returned a 403 or equivalent block to a polite research user-agent. The blocked set included NEJM, JAMA, Science, Cell, Nature Reviews, the Lancet family, and most journals hosted on Silverchair, Atypon, or Elsevier's linking hub platform. PLoS ONE, EMBO Journal, Nature Communications, JCI, and the main Nature title were reachable.

Is a publisher blocking AI crawlers the same as declaring it in robots.txt?

No. A publisher can declare in robots.txt that GPTBot, ClaudeBot, and PerplexityBot are allowed and still return HTTP 403 at the server level. We observed this gap on multiple journals, including NEJM. Robots.txt is a declared intent; the server response is the operational reality. AI systems cite what they can actually retrieve.

Why can Google Scholar index these journals if AI crawlers cannot?

Google Scholar's crawler is whitelisted by most major publisher platforms through direct agreements, typically based on IP range or TLS fingerprint. GPTBot, ClaudeBot, and PerplexityBot are newer, do not have these agreements, and are not whitelisted. The result is that a paper can be perfectly findable in Google Scholar and effectively invisible to ChatGPT or Perplexity.

Does this mean I should publish in PLoS ONE instead of NEJM?

No. Citation impact and AI discoverability are different things, and for most careers the citation impact matters more. What this measurement does suggest is that a paper in a crawler-blocked journal needs to work harder on its other discoverability surfaces: a findable preprint, a complete ORCID profile, a well-structured abstract that survives propagation to PubMed Central and review sites, and, where possible, green open access deposit.

What is a "polite research crawler" and why does the distinction matter?

A polite research crawler identifies itself honestly with a contact email, obeys robots.txt, rate-limits to one request per second per domain, and does not attempt to bypass authentication. Our measurement was conducted under these conditions, which approximates how a new AI crawler or a scholarly metadata aggregator would behave. A full browser would get through more publisher platforms, but that would measure what a human can see, not what AI crawlers can retrieve.

Want to know if a specific paper is retrievable by AI crawlers?

Our 115-point audit includes a platform-level access check, the same test we ran in this study, but applied to your specific paper's landing page, with a report on which AI crawlers can reach it and what structured data they can extract.

Submit your paper →