View the ranking · Download the data

v3 artifacts

Every score in the database can be reproduced by re-running the build script. The raw JSON contains the full per-journal audit trail, every API response, every robots.txt parse, every sample article URL fetched, the fetch source tag for each (cache, urllib, curl, or manual Chrome harvest), the Pass-1 research-UA reachability flag, and every per-check raw value.

1. What this ranking measures: and what it does not

This ranking scores journals on discoverability infrastructure: how well a journal’s article HTML is technically set up for search engines (Google, Google Scholar) and AI-citation crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) after a paper is published.

It does not measure:

A journal can have a high Impact Factor and still score poorly here if its platform has weak technical SEO, and vice versa. The two are complementary, not substitutes. For an argument on why venue-level Impact Factor is a weak predictor of an individual paper’s real-world impact, see beyond impact factor.

2. Why v3 looks different from v2

v2 had 32 checks across 5 categories (A–E) and scored 8 scholarly aggregators alongside the journals. v3 dropped both:

The v2 server-reachability check (D7, “does the server respond at all to a research crawler”) is preserved as a separate raw_d7_reachable datum in the JSON for the crawler-blocking blog post and is exposed on the public ranking, but it is not part of the composite. v3 removes the conflation between “this journal blocks polite scholarly crawlers” (a reachability finding) and “this journal’s HTML is well-marked-up for retrieval” (a compliance finding). They are different claims and they get different surfaces.

3. Data sources

Every metric is derived from one of the following public, machine-readable sources. No scraping of gated content, no circumvention of robots.txt, no private APIs.

SourceUsed for
OpenAlexJournal list, h-index, OA status, ISSN, homepage, sample article candidates
CrossRefDOI deposit count per ISSN (registry presence floor)
DOAJOA journal membership
NLM E-utilitiesMEDLINE indexing (NLM Catalog currentindexingstatus)
Publisher robots.txtAI-bot allowances (GPTBot, ClaudeBot, PerplexityBot, Google-Extended), parsed by Protego (Scrapy’s RFC 9309 parser)
Publisher llms.txtEmerging AI-friendly disclosure standard
Publisher sitemap.xml/sitemap.xml directly or via Sitemap: directive in robots.txt
Sample article HTMLEvery on-page check in §4. Fetched as Googlebot via a three-tier chain (see §3.2).

3.1 Sample article selection

For each journal we request up to 8 candidate articles from OpenAlex (type:article, has_doi:true, has_abstract:true, sort=publication_date:desc), reject any whose resolved URL is on a domain unrelated to the journal homepage (a guard against OpenAlex attribution errors), reject any that resolves to a small meta-refresh shell with no markup, and keep the first 3 that pass. Per-check scores are aggregated by median (continuous) or majority vote (binary). Sampled DOIs and resolved URLs are recorded in the JSON for reproducibility.

For Elsevier journals where doi.org resolves to linkinghub.elsevier.com, a 2–3 kB meta-refresh shell, the build rewrites the URL to https://www.sciencedirect.com/science/article/pii/{pii}, which serves the article HTML with Highwire Press meta tags. Verified empirically; documented in the build script.

3.2 Dual-UA fetch chain: disclosed, not hidden

A core finding of the v2 work was that 15+ top-50 publishers return HTTP 403 to a polite scholarly research crawler at the TLS / IP layer regardless of what their robots.txt declares. This created a structural problem for v2: those journals scored low on every on-page check not because their HTML was bad, but because we never saw their HTML.

v3 splits the fetch into two passes:

  1. Pass 1, research-UA probe. A single GET with User-Agent: AcademicSEO-Research/1.0 (+mailto:gkumar@academicseo.co.uk). The result (HTTP 200 vs. 403/blocked/timeout) is recorded as research_ua_reachable per article and aggregated as raw_d7_reachable per journal. This pass is not used for scoring. It exists so the legacy crawler-blocking finding remains reproducible and is exposed on the public ranking as the “Research UA %” column.
  2. Pass 2, Googlebot audit fetch, three-tier chain. The actual scoring fetch.
    • Tier A: HTML cache. Per-URL cache keyed by sha256("gbot|" + url)[:24]. If a previous run captured the article HTML, we re-use it. The cache lives in journal-seo-rankings/cache/.
    • Tier B: urllib with Googlebot UA. User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html).
    • Tier C: curl with the same Googlebot UA, different TLS stack. Cloudflare and Akamai inspect the TLS ClientHello fingerprint (JA3) for bot detection; Python’s urllib and a system curl produce different fingerprints, and we observe in practice that some publishers reject one and accept the other.
    • Residual. Any URL that all three tiers fail to retrieve is captured manually from a real Chrome session (real TLS fingerprint, real cookies, real JavaScript execution) and added to the HTML cache under the same hash key. The next build picks it up automatically.

Each per-article score record carries a fetch_source tag (cache | urllib_gbot | curl_gbot | failed) so any score can be traced back to which tier produced its HTML.

Why Googlebot is a legitimate proxy. The framework is asking “what does Google see on this page?” Spoofing the UA does not change the answer to that question, it asks the publisher to serve us the same HTML they serve Googlebot. Cloudflare publishes a Googlebot verification protocol that uses reverse-DNS, which we cannot satisfy from a non-Google IP, so any publisher that performs that check will still 403 us; in practice almost no biomedical publisher does the reverse-DNS step, so the spoof works for the audit purpose. We disclose the UA, we do not pretend. The build script header and this document both name the UA used.

4. The 3 metric categories and 17 checks

Category 1: On-page article SEO (60% weight)

Direct inspection of the article HTML. Ten checks; the category score is the unweighted mean.

#CheckType
p1Highwire Press citation_* meta tag completeness (citation_title, citation_author, citation_publication_date, citation_journal_title, citation_volume, citation_issue, citation_firstpage, citation_lastpage, citation_pdf_url, citation_doi)continuous (N/10)
p2Schema.org Article or ScholarlyArticle JSON-LD with required fields (@type, headline, image, datePublished, author, publisher)binary
p3OpenGraph completeness (og:title, og:description, og:image, og:type, og:url)continuous (N/5)
p4Twitter Card completeness (twitter:card, twitter:title, twitter:description, twitter:image)continuous (N/4)
p5Canonical tag present and exact-match to the resolved article URLbinary
p6<title> present, plausible lengthbinary
p7<meta name="description"> present, 70–200 charactersbinary
p8Single <h1> presentbinary
p9Semantic HTML mean (<article> / <section> / heading hierarchy + figure alt-text coverage)continuous
p10Abstract present in initial HTML body and total visible text > 2,000 chars (rules out JS-only renders and stubs)binary

Category 1 score = mean of p1…p10. The composite weights this 60% because, conditional on a page being reachable at all, on-page markup is what determines whether Google Scholar parses the article correctly, whether structured data renders in SERP, and whether AI crawlers can extract a citation-worthy block.

Category 2: Registry presence (20% weight)

Whether the journal exists in the registries that scholarly retrieval and AI systems consult before they ever look at an article URL. Four binary checks.

#CheckSource
r1CrossRef has > 100 deposited DOIs for this journal’s ISSNCrossRef API
r2NLM Catalog currentindexingstatus = Y (MEDLINE indexed)NLM E-utilities
r3DOAJ-listed (auto-pass for closed-access journals)DOAJ API / OpenAlex is_oa
r4OpenAlex works_count > 1000OpenAlex

Category 2 score = mean of r1…r4. Closed-access journals are not penalised on r3, DOAJ membership is conditional on being open-access, so it would be a category error to dock a paywalled journal for not appearing there. The auto-pass is documented and machine-checkable: r3 = 1 if not is_oa else doaj_listed.

Category 3: Crawler-friendly (20% weight)

Whether the journal’s robots posture and discoverability infrastructure invite scholarly + AI retrieval. Three checks.

#CheckSource
c1Fraction of major AI bots allowed in robots.txt, GPTBot, ClaudeBot, PerplexityBot, Google-Extendedrobots.txt (Protego)
c2llms.txt present at site rootpublisher root
c3sitemap.xml discoverable, either /sitemap.xml resolves to a valid <urlset> / <sitemapindex>, or robots.txt declares a Sitemap: directivesitemap fetch / robots.txt

Category 3 score = (c1 + c2 + c3) / 3. The four AI-bot allowances are aggregated within c1 rather than spread across four separate checks because on most platforms they correlate near-perfectly, if a publisher updates robots.txt to allow GPTBot, they typically allow the others in the same diff.

Composite

Composite = (0.60 × on_page) + (0.20 × registry) + (0.20 × crawler)
Score = round(Composite × 100, 1)

Tier bands:

Per-category scores are exposed in the workbook and JSON; recompute the composite under any other weighting you prefer.

5. The raw_d7_reachable datum

For each journal, raw_d7_reachable is the fraction of sample article URLs that returned HTTP 200 to the research crawler UA in pass 1. It is not part of the composite. It exists to let the crawler-blocking blog post remain reproducible and to keep the “is this publisher hostile to scholarly bots regardless of robots.txt” finding on the public ranking page.

A journal can have a high composite (clean on-page markup, indexed in MEDLINE, AI bots declared allowed) and a low raw_d7_reachable simultaneously, that is the central finding of the v2 dataset. The two metrics are different questions and they get different surfaces.

6. Known limitations: stated honestly

  1. The Googlebot UA spoof is not Googlebot. Publishers that reverse-DNS-verify the source IP per Google’s verifying-googlebot protocol will still 403 us. In practice almost no biomedical publisher does this; in the cases where they do, the URL is captured manually from a real Chrome session instead.
  2. k = 3 sampling is small. Three articles per journal; per-check median and majority vote are computed across them. A journal with two templates live at once may show its newer template only.
  3. OpenAlex’s recency-first sample biases toward the current platform. Publishers in the middle of a migration may not be scored on legacy article URLs.
  4. JS-rendered article body is penalised. If the initial HTML doesn’t contain the abstract text and ≥ 2,000 chars of body, p10 fails. This is intentional, Google Scholar’s crawler is conservative about JS execution and we follow the same convention.
  5. The composite is a weighted summary. If you disagree with the 60/20/20 split, recompute from the per-category scores in journal_seo_raw.json. The data is the deliverable; the composite is one summary of it.
  6. Score drift. Publisher HTML changes. A score stamped more than 6 months old should be treated as stale.

7. Validation: by direct inspection, not by downstream prediction

The framework’s validation is the inspection itself. Every check has a defined pass criterion that any reader can verify by visiting the sampled DOI, viewing the page source, and checking whether the markup the check looks for is present or absent. There is no SERP-correlation experiment, no machine-learned weighting, and no hidden judgment between the HTML and the score.

Footnote on the validation history. An earlier draft (v1) specified a Phase 2 validation that would test the framework against general Google web search rank. We ran it on 64 sampled articles across 22 reachable journals; the result was Spearman ρ = 0.05, far below the 0.4 threshold the methodology had set. The framework did not “fail” this validation, the validation tested the wrong question. For papers that are reachable, exact-title general-Google retrieval is essentially a solved problem (53% of articles surfaced their canonical URL at SERP rank 1, 98% within the top 3), so the dependent variable had near-zero variance to predict, and PubMed/PMC dominated top results via raw domain authority. The methodology was revised on 14 April 2026 to state explicitly that validation is by inspection.

8. Reproducibility

The build is reproducible end-to-end. The script, the cached article HTML, and every upstream API response that fed into a score are all published in this directory. Re-running python3 build_journal_seo_db.py regenerates journal_seo_raw.json byte-for-byte, conditional only on the upstream sources (OpenAlex, CrossRef, DOAJ, NLM, publisher robots.txt) being stable at the moment you run it. That reproducibility is the point: any score you distrust can be re-derived from first principles without taking this project’s word for anything.

The raw JSON records, for each journal: the exact run timestamp; the three sample article DOIs and their resolved URLs; the fetch source tag (cache | urllib_gbot | curl_gbot | failed) per sample; the research-UA reachability flag per sample (Pass 1 result); every per-check raw value; the aggregated category scores; the composite and tier. A score you can’t trace to a specific sample and a specific raw value is a score we would not have published.

9. Document history

Note. This methodology is published under CC-BY 4.0 and is open to critique. Send methodology disputes to gkumar@academicseo.co.uk. Score disputes require a reproduction of the check that contradicts the DB entry, the methodology commits to re-running the check and correcting if the upstream source has changed.

Related context for PIs reading this: the reason journal-level discoverability matters to individual papers is the same reason the findability-is-now-a-funding-requirement argument holds, funders, not just citation committees, are starting to treat whether your work can actually be found as part of its track record. And if your paper sits on a platform that scores poorly here, a well-optimised preprint is often the most reliable route to a reachable canonical URL.