v3 artifacts
Every score in the database can be reproduced by re-running the build script. The raw JSON contains the full per-journal audit trail, every API response, every robots.txt parse, every sample article URL fetched, the fetch source tag for each (cache, urllib, curl, or manual Chrome harvest), the Pass-1 research-UA reachability flag, and every per-check raw value.
1. What this ranking measures: and what it does not
This ranking scores journals on discoverability infrastructure: how well a journal’s article HTML is technically set up for search engines (Google, Google Scholar) and AI-citation crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) after a paper is published.
It does not measure:
- Citation impact (use Journal Impact Factor, h5-index, SCImago Journal Rank, or Scopus CiteScore for that)
- Editorial quality or peer review rigor
- Journal prestige or reputation
- Rejection rate or publication speed
- Author satisfaction or submission experience
A journal can have a high Impact Factor and still score poorly here if its platform has weak technical SEO, and vice versa. The two are complementary, not substitutes. For an argument on why venue-level Impact Factor is a weak predictor of an individual paper’s real-world impact, see beyond impact factor.
2. Why v3 looks different from v2
v2 had 32 checks across 5 categories (A–E) and scored 8 scholarly aggregators alongside the journals. v3 dropped both:
- Repositories were removed. Comparing PubMed Central to Cell on the same scorecard inflated the apparent quality of repositories on dimensions where they aren’t actually competing for the same retrieval task.
- Categories A and E were removed from the composite. A (indexing prerequisites) almost-always passed for the top-50 cohort and had near-zero discriminating power. E (access & openness) entangled licensing, an editorial choice, with discoverability. The parts of A and E that materially affect retrieval (CrossRef coverage, MEDLINE indexing, DOAJ membership for OA journals, OpenAlex works counts) were folded into a single Registry category.
- The result is 17 checks across 3 categories, with weights anchored to where retrieval actually breaks: on-page article markup is doing the heavy lifting, and registry/crawler signals are the secondary gates.
The v2 server-reachability check (D7, “does the server respond at all to a research crawler”) is preserved as a separate raw_d7_reachable datum in the JSON for the crawler-blocking blog post and is exposed on the public ranking, but it is not part of the composite. v3 removes the conflation between “this journal blocks polite scholarly crawlers” (a reachability finding) and “this journal’s HTML is well-marked-up for retrieval” (a compliance finding). They are different claims and they get different surfaces.
3. Data sources
Every metric is derived from one of the following public, machine-readable sources. No scraping of gated content, no circumvention of robots.txt, no private APIs.
| Source | Used for |
|---|---|
| OpenAlex | Journal list, h-index, OA status, ISSN, homepage, sample article candidates |
| CrossRef | DOI deposit count per ISSN (registry presence floor) |
| DOAJ | OA journal membership |
| NLM E-utilities | MEDLINE indexing (NLM Catalog currentindexingstatus) |
| Publisher robots.txt | AI-bot allowances (GPTBot, ClaudeBot, PerplexityBot, Google-Extended), parsed by Protego (Scrapy’s RFC 9309 parser) |
| Publisher llms.txt | Emerging AI-friendly disclosure standard |
| Publisher sitemap.xml | /sitemap.xml directly or via Sitemap: directive in robots.txt |
| Sample article HTML | Every on-page check in §4. Fetched as Googlebot via a three-tier chain (see §3.2). |
3.1 Sample article selection
For each journal we request up to 8 candidate articles from OpenAlex (type:article, has_doi:true, has_abstract:true, sort=publication_date:desc), reject any whose resolved URL is on a domain unrelated to the journal homepage (a guard against OpenAlex attribution errors), reject any that resolves to a small meta-refresh shell with no markup, and keep the first 3 that pass. Per-check scores are aggregated by median (continuous) or majority vote (binary). Sampled DOIs and resolved URLs are recorded in the JSON for reproducibility.
For Elsevier journals where doi.org resolves to linkinghub.elsevier.com, a 2–3 kB meta-refresh shell, the build rewrites the URL to https://www.sciencedirect.com/science/article/pii/{pii}, which serves the article HTML with Highwire Press meta tags. Verified empirically; documented in the build script.
3.2 Dual-UA fetch chain: disclosed, not hidden
A core finding of the v2 work was that 15+ top-50 publishers return HTTP 403 to a polite scholarly research crawler at the TLS / IP layer regardless of what their robots.txt declares. This created a structural problem for v2: those journals scored low on every on-page check not because their HTML was bad, but because we never saw their HTML.
v3 splits the fetch into two passes:
- Pass 1, research-UA probe. A single GET with
User-Agent: AcademicSEO-Research/1.0 (+mailto:gkumar@academicseo.co.uk). The result (HTTP 200 vs. 403/blocked/timeout) is recorded asresearch_ua_reachableper article and aggregated asraw_d7_reachableper journal. This pass is not used for scoring. It exists so the legacy crawler-blocking finding remains reproducible and is exposed on the public ranking as the “Research UA %” column. - Pass 2, Googlebot audit fetch, three-tier chain. The actual scoring fetch.
- Tier A: HTML cache. Per-URL cache keyed by
sha256("gbot|" + url)[:24]. If a previous run captured the article HTML, we re-use it. The cache lives injournal-seo-rankings/cache/. - Tier B: urllib with Googlebot UA.
User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html). - Tier C: curl with the same Googlebot UA, different TLS stack. Cloudflare and Akamai inspect the TLS ClientHello fingerprint (JA3) for bot detection; Python’s
urlliband a systemcurlproduce different fingerprints, and we observe in practice that some publishers reject one and accept the other. - Residual. Any URL that all three tiers fail to retrieve is captured manually from a real Chrome session (real TLS fingerprint, real cookies, real JavaScript execution) and added to the HTML cache under the same hash key. The next build picks it up automatically.
- Tier A: HTML cache. Per-URL cache keyed by
Each per-article score record carries a fetch_source tag (cache | urllib_gbot | curl_gbot | failed) so any score can be traced back to which tier produced its HTML.
Why Googlebot is a legitimate proxy. The framework is asking “what does Google see on this page?” Spoofing the UA does not change the answer to that question, it asks the publisher to serve us the same HTML they serve Googlebot. Cloudflare publishes a Googlebot verification protocol that uses reverse-DNS, which we cannot satisfy from a non-Google IP, so any publisher that performs that check will still 403 us; in practice almost no biomedical publisher does the reverse-DNS step, so the spoof works for the audit purpose. We disclose the UA, we do not pretend. The build script header and this document both name the UA used.
4. The 3 metric categories and 17 checks
Category 1: On-page article SEO (60% weight)
Direct inspection of the article HTML. Ten checks; the category score is the unweighted mean.
| # | Check | Type |
|---|---|---|
| p1 | Highwire Press citation_* meta tag completeness (citation_title, citation_author, citation_publication_date, citation_journal_title, citation_volume, citation_issue, citation_firstpage, citation_lastpage, citation_pdf_url, citation_doi) | continuous (N/10) |
| p2 | Schema.org Article or ScholarlyArticle JSON-LD with required fields (@type, headline, image, datePublished, author, publisher) | binary |
| p3 | OpenGraph completeness (og:title, og:description, og:image, og:type, og:url) | continuous (N/5) |
| p4 | Twitter Card completeness (twitter:card, twitter:title, twitter:description, twitter:image) | continuous (N/4) |
| p5 | Canonical tag present and exact-match to the resolved article URL | binary |
| p6 | <title> present, plausible length | binary |
| p7 | <meta name="description"> present, 70–200 characters | binary |
| p8 | Single <h1> present | binary |
| p9 | Semantic HTML mean (<article> / <section> / heading hierarchy + figure alt-text coverage) | continuous |
| p10 | Abstract present in initial HTML body and total visible text > 2,000 chars (rules out JS-only renders and stubs) | binary |
Category 1 score = mean of p1…p10. The composite weights this 60% because, conditional on a page being reachable at all, on-page markup is what determines whether Google Scholar parses the article correctly, whether structured data renders in SERP, and whether AI crawlers can extract a citation-worthy block.
Category 2: Registry presence (20% weight)
Whether the journal exists in the registries that scholarly retrieval and AI systems consult before they ever look at an article URL. Four binary checks.
| # | Check | Source |
|---|---|---|
| r1 | CrossRef has > 100 deposited DOIs for this journal’s ISSN | CrossRef API |
| r2 | NLM Catalog currentindexingstatus = Y (MEDLINE indexed) | NLM E-utilities |
| r3 | DOAJ-listed (auto-pass for closed-access journals) | DOAJ API / OpenAlex is_oa |
| r4 | OpenAlex works_count > 1000 | OpenAlex |
Category 2 score = mean of r1…r4. Closed-access journals are not penalised on r3, DOAJ membership is conditional on being open-access, so it would be a category error to dock a paywalled journal for not appearing there. The auto-pass is documented and machine-checkable: r3 = 1 if not is_oa else doaj_listed.
Category 3: Crawler-friendly (20% weight)
Whether the journal’s robots posture and discoverability infrastructure invite scholarly + AI retrieval. Three checks.
| # | Check | Source |
|---|---|---|
| c1 | Fraction of major AI bots allowed in robots.txt, GPTBot, ClaudeBot, PerplexityBot, Google-Extended | robots.txt (Protego) |
| c2 | llms.txt present at site root | publisher root |
| c3 | sitemap.xml discoverable, either /sitemap.xml resolves to a valid <urlset> / <sitemapindex>, or robots.txt declares a Sitemap: directive | sitemap fetch / robots.txt |
Category 3 score = (c1 + c2 + c3) / 3. The four AI-bot allowances are aggregated within c1 rather than spread across four separate checks because on most platforms they correlate near-perfectly, if a publisher updates robots.txt to allow GPTBot, they typically allow the others in the same diff.
Composite
Composite = (0.60 × on_page) + (0.20 × registry) + (0.20 × crawler)
Score = round(Composite × 100, 1)
Tier bands:
- Excellent, 85+
- Good, 70–84
- Adequate, 50–69
- Poor, below 50
Per-category scores are exposed in the workbook and JSON; recompute the composite under any other weighting you prefer.
5. The raw_d7_reachable datum
For each journal, raw_d7_reachable is the fraction of sample article URLs that returned HTTP 200 to the research crawler UA in pass 1. It is not part of the composite. It exists to let the crawler-blocking blog post remain reproducible and to keep the “is this publisher hostile to scholarly bots regardless of robots.txt” finding on the public ranking page.
A journal can have a high composite (clean on-page markup, indexed in MEDLINE, AI bots declared allowed) and a low raw_d7_reachable simultaneously, that is the central finding of the v2 dataset. The two metrics are different questions and they get different surfaces.
6. Known limitations: stated honestly
- The Googlebot UA spoof is not Googlebot. Publishers that reverse-DNS-verify the source IP per Google’s verifying-googlebot protocol will still 403 us. In practice almost no biomedical publisher does this; in the cases where they do, the URL is captured manually from a real Chrome session instead.
- k = 3 sampling is small. Three articles per journal; per-check median and majority vote are computed across them. A journal with two templates live at once may show its newer template only.
- OpenAlex’s recency-first sample biases toward the current platform. Publishers in the middle of a migration may not be scored on legacy article URLs.
- JS-rendered article body is penalised. If the initial HTML doesn’t contain the abstract text and ≥ 2,000 chars of body, p10 fails. This is intentional, Google Scholar’s crawler is conservative about JS execution and we follow the same convention.
- The composite is a weighted summary. If you disagree with the 60/20/20 split, recompute from the per-category scores in
journal_seo_raw.json. The data is the deliverable; the composite is one summary of it. - Score drift. Publisher HTML changes. A score stamped more than 6 months old should be treated as stale.
7. Validation: by direct inspection, not by downstream prediction
The framework’s validation is the inspection itself. Every check has a defined pass criterion that any reader can verify by visiting the sampled DOI, viewing the page source, and checking whether the markup the check looks for is present or absent. There is no SERP-correlation experiment, no machine-learned weighting, and no hidden judgment between the HTML and the score.
Footnote on the validation history. An earlier draft (v1) specified a Phase 2 validation that would test the framework against general Google web search rank. We ran it on 64 sampled articles across 22 reachable journals; the result was Spearman ρ = 0.05, far below the 0.4 threshold the methodology had set. The framework did not “fail” this validation, the validation tested the wrong question. For papers that are reachable, exact-title general-Google retrieval is essentially a solved problem (53% of articles surfaced their canonical URL at SERP rank 1, 98% within the top 3), so the dependent variable had near-zero variance to predict, and PubMed/PMC dominated top results via raw domain authority. The methodology was revised on 14 April 2026 to state explicitly that validation is by inspection.
8. Reproducibility
The build is reproducible end-to-end. The script, the cached article HTML, and every upstream API response that fed into a score are all published in this directory. Re-running python3 build_journal_seo_db.py regenerates journal_seo_raw.json byte-for-byte, conditional only on the upstream sources (OpenAlex, CrossRef, DOAJ, NLM, publisher robots.txt) being stable at the moment you run it. That reproducibility is the point: any score you distrust can be re-derived from first principles without taking this project’s word for anything.
The raw JSON records, for each journal: the exact run timestamp; the three sample article DOIs and their resolved URLs; the fetch source tag (cache | urllib_gbot | curl_gbot | failed) per sample; the research-UA reachability flag per sample (Pass 1 result); every per-check raw value; the aggregated category scores; the composite and tier. A score you can’t trace to a specific sample and a specific raw value is a score we would not have published.
9. Document history
- 2026-04-14 v1.0, Initial methodology. Phase 1 scope (top 50 biomedical journals). Multi-sample (k = 3) aggregation. Protego-validated robots.txt parsing. Elsevier ScienceDirect URL rewriting. Separate “Blocked” tier with re-normalised composite.
- 2026-04-14 v2.0, Re-anchored entire framework to Google Search Central + Google Scholar Inclusion Guidelines as primary authority. 32 checks across 5 categories (A–E). Added 8 scholarly repositories (PubMed, PMC, Europe PMC, bioRxiv, medRxiv, Semantic Scholar, Zenodo, OpenAlex). Added D7 (server-reachability) as a scored check. Validation reframed from SERP correlation to direct inspection.
- 2026-04-14 v3.0, Dropped repositories. Dropped categories A and E from the composite (kept the parts that materially affect retrieval and folded them into a single Registry category). Reduced to 17 checks across 3 categories with 60/20/20 weighting. Added dual-UA fetch chain (research-UA probe for the legacy reachability datum + Googlebot three-tier audit fetch). The v2 D7 check is preserved as
raw_d7_reachablebut is not part of the composite. Per-URL HTML cache added for reproducibility and to support manual Chrome capture of residual URLs that the automated fetch chain could not retrieve.
Related context for PIs reading this: the reason journal-level discoverability matters to individual papers is the same reason the findability-is-now-a-funding-requirement argument holds, funders, not just citation committees, are starting to treat whether your work can actually be found as part of its track record. And if your paper sits on a platform that scores poorly here, a well-optimised preprint is often the most reliable route to a reachable canonical URL.