Reproducible artifacts, version 1.0
Every score in the database can be reproduced by re-running the build script against the same journal list and sample article DOIs. The raw JSON contains the full audit trail per journal — every API response, every robots.txt parse, every sample article URL fetched.
1. What this ranking measures — and what it does not
This ranking scores journals on discoverability infrastructure: how well a journal's platform helps an article get found on Google Scholar, PubMed, general Google, and AI search engines (Google AI Overviews, ChatGPT, Perplexity, Bing Copilot) after it is published.
It does not measure:
- Citation impact (use Journal Impact Factor, h5-index, SCImago Journal Rank, or Scopus CiteScore for that)
- Editorial quality or peer review rigor
- Journal prestige or reputation
- Rejection rate or publication speed
- Author satisfaction or submission experience
Citation-based rankings answer: "once a paper is read, how much is it cited?" This ranking answers: "can a paper be found in the first place?" The two are complementary, not substitutes. A journal can have a high Impact Factor and still score poorly here if its platform has weak technical SEO, and vice versa.
2. Why this ranking exists
No established journal SEO ranking system currently exists. Citation-based journal metrics (JCR, Scopus, SJR, Google Scholar Metrics) measure citation impact and prestige. The DOAJ Seal was retired in April 2025. Plan S compliance indicators are binary and cover only OA journals. The closest existing work is academic SEO best-practice literature (Beel & Gipp, Scholastica, De Gruyter author resources) which lists factors without combining them into a comparable per-journal score.
The gap is consequential because:
- Discoverability infrastructure is heavily author-facing but invisible at submission time. PIs choose journals based on impact and fit; they inherit the platform's SEO quality with no warning.
- Google's March 2026 core update introduced site-wide quality aggregation, meaning a journal platform's weaker article pages drag down its stronger ones in the same way a paper's weak abstract can drag down its full text.
- AI search engines now cite scholarly content heavily in Overviews and conversation answers, and the crawler posture of publisher platforms toward GPTBot, ClaudeBot, PerplexityBot, and Google-Extended directly determines whether a given paper can be cited in those answers.
3. Data sources
Every metric is derived from one of the following public, machine-readable sources. No scraping of gated content, no circumvention of robots.txt, no private APIs.
| Source | Used for |
|---|---|
| OpenAlex | Journal list, citation metrics, DOAJ status, OA status, ISSN, homepage, sample article IDs |
| CrossRef REST API | DOI registration, member metadata completeness, reference deposition |
| DOAJ API | OA status, license, APC, machine-readability flags |
| NLM Catalog (E-utilities) | MEDLINE indexing (currentindexingstatus), PubMed Central deposition (PMC esearch with article-count ratio against OpenAlex works_count) |
| Publisher robots.txt | AI crawler posture (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, ChatGPT-User, CCBot, anthropic-ai). Parsed with both a hand-rolled implementation and the Protego library used by Scrapy; 100% agreement on 160 cross-validation checks. |
| Sample article HTML | Highwire Press citation_* meta tags, Schema.org Article, canonical, OG tags, JATS XML link, HTML full text availability, ORCID display, abstract in HTML, semantic structure. Fetched with User-Agent: AcademicSEO-Research/1.0 (+mailto:[email protected]). |
Sample article selection. For each journal, up to eight recent original research articles in OpenAlex with type:article,has_doi:true,has_abstract:true, sorted by publication_date:desc. The build attempts to fetch each candidate's landing page in order, rejecting candidates whose resolved URL is on a domain unrelated to the journal homepage (a guard against OpenAlex attribution errors), or whose response is a small meta-refresh shell with no citation tags. The first three valid samples are kept and per-check scores are aggregated by median (continuous fields) or majority vote (binary fields). Of the 24 scored journals, 21 have k = 3 valid samples; 2 have k = 1; 1 has k = 0.
Elsevier ScienceDirect rewriting. Where doi.org resolves to linkinghub.elsevier.com — a 2-3 kB meta-refresh shell — the build rewrites the URL to https://www.sciencedirect.com/science/article/pii/{pii}, which serves the article HTML with Highwire Press meta tags. This recovers Cell Press, Lancet, JBC, Gastroenterology, NeuroImage, Immunity, Molecular Cell, and JACC from the blocked tier they would otherwise sit in.
4. The 5 metric categories and 27 checks
Every check is either binary (pass/fail, 1 or 0) or continuous (0–1). Weights are reported in §5.
Category A — Indexing & registry coverage (20%)
| # | Check | Source | Type |
|---|---|---|---|
| A1 | MEDLINE indexed (currentindexingstatus = Y) | NLM Catalog | binary |
| A2 | PubMed Central auto-deposition (PMC count ÷ OpenAlex works_count ≥ 0.5) | PMC esearch | binary |
| A3 | DOAJ listed | DOAJ / OpenAlex | binary |
| A4 | CrossRef member with active DOI registration | CrossRef | binary |
| A5 | OpenAlex coverage with ≥90% of published works indexed | OpenAlex | continuous |
| A6 | Scopus or Web of Science indexed (is_core:true) | OpenAlex | binary |
| A7 | Valid ISSN | OpenAlex | binary |
Rationale. Registry presence is a hard ceiling on visibility. A paper in a journal that is not MEDLINE-indexed cannot appear in PubMed regardless of how good its metadata is. Weighted 20% — not higher, because most respectable biomedical journals clear the registry bar, and the differentiation happens in subsequent categories.
Category B — Scholar & search technical SEO (25%)
The heaviest-weighted category because it dominates the two highest-volume search surfaces (Google Scholar, general Google).
| # | Check | Type |
|---|---|---|
| B1 | Highwire Press citation_* meta tags complete — title, author, publication_date, journal_title, volume, issue, firstpage, lastpage, pdf_url, doi (count present out of 10) | continuous |
| B2 | Schema.org Article or ScholarlyArticle JSON-LD present and valid | binary |
| B3 | Canonical tag present and matches article URL | binary |
| B4 | HTML full text available (not PDF-only) | binary |
| B5 | PDF has selectable text layer (heuristic) | binary |
| B6 | Open Graph tags complete (og:title, og:description, og:image, og:type) | continuous |
| B7 | HTTPS on article URL | binary |
| B8 | Clean URL structure (no raw query strings, DOI or slug in path) | binary |
| B9 | Semantic HTML (single h1, sectioning, figcaptions) | continuous |
Rationale. B1 is the single most important check in the entire framework. Google Scholar's own inclusion guidelines explicitly prefer Highwire Press citation_* meta tags and discourage Dublin Core for journal content. A journal missing these tags cannot be indexed in Scholar properly, regardless of how good everything else on the page is. B1 carries roughly 40% of the within-category weight (≈10% of the composite).
Category C — Metadata richness & Plan S compliance (20%)
| # | Check | Source |
|---|---|---|
| C1 | JATS XML exposure (linked from article landing page or machine-readable) | HTML + CrossRef |
| C2 | Abstract present in HTML (not behind JS or login) | HTML |
| C3 | MeSH terms or keywords in HTML microdata or meta tags | HTML |
| C4 | Author ORCID iDs displayed and marked up | HTML |
| C5 | Author affiliations with ROR IDs | HTML |
| C6 | Funding statements with machine-readable grant IDs | HTML + CrossRef |
| C7 | Reference list machine-readable with DOIs/PMIDs; CrossRef cited-references deposited | CrossRef |
| C8 | Data availability statement present | HTML |
| C9 | Figure alt text populated | HTML |
Rationale. Plan S compliance criteria anchor this category. These checks map to specific Plan S technical requirements (machine-readable metadata, JATS XML, CC0 metadata licensing) and to the fields that Google AI Overviews and Perplexity use when extracting structured answers from scholarly sources.
Category D — AI search posture (15%)
| # | Check | Source |
|---|---|---|
| D1 | robots.txt allows GPTBot | robots.txt (parsed by Protego) |
| D2 | robots.txt allows ClaudeBot | robots.txt (parsed by Protego) |
| D3 | robots.txt allows PerplexityBot | robots.txt (parsed by Protego) |
| D4 | robots.txt allows Google-Extended (distinct from Googlebot) | robots.txt (parsed by Protego) |
| D5 | llms.txt present at root | journal root (bonus) |
| D6 | Article HTML served without JS rendering (initial HTML contains abstract text) | sample HTML |
Rationale. A publisher blocking AI crawlers in robots.txt cannot have its articles cited in AI Overviews. This category differentiates journals harshly because many major publishers currently block all AI user agents; a handful explicitly allow them. The differentiation is real and currently predictive of AI citation presence.
Category E — Access, openness & licensing (20%)
| # | Check | Source |
|---|---|---|
| E1 | Open Access status: gold (3), hybrid (2), green (1), none (0) | OpenAlex / DOAJ |
| E2 | CC0 or CC-BY metadata licensing (Plan S requirement) | DOAJ / CrossRef |
| E3 | Machine-readable license tag in article HTML | HTML |
| E4 | Machine-readable OA status tag in article HTML | HTML |
| E5 | PubMed Central auto-deposition for articles | NLM / PMC |
| E6 | Preprint-friendly policy (accepted without embargo) | journal policy |
| E7 | APC disclosed clearly on a public page | DOAJ / homepage |
Rationale. Google Scholar explicitly weights free full-text availability in its ranking algorithm. Plan S technical criteria require machine-readable license and OA status metadata. Both factors directly influence whether a paper is surfaced to a non-subscribed reader or an AI crawler without a paywall redirect.
5. Weights and composite score
For fully scored journals:
Composite = 0.20 × A + 0.25 × B + 0.20 × C + 0.15 × D + 0.20 × E
Category scores are normalised to 0–1 before the composite is computed. Within each category, individual checks are weighted uniformly unless otherwise noted (B1 is the exception — see §4).
The composite is mapped to four tier bands:
- Excellent — 85 to 100
- Good — 70 to 84
- Adequate — 50 to 69
- Poor — below 50
Plus a fifth label, Blocked, applied to journals where article HTML fetching failed entirely. For these journals the composite is computed only over the categories that do not depend on article HTML — A (20%), D (15%), and E (20%) — re-normalised over the 55% available denominator, and the result is clearly tagged in the output. Their scores are not directly comparable to fully scored journals and they are listed separately at the bottom of the ranking, not interleaved.
6. Known limitations — stated honestly
- Sample article bias. We sample three articles per journal where possible (median for continuous, majority for binary). Some journals have only one valid sample after filtering; one has zero. The sampled DOIs are recorded in the raw data for auditability.
- Score drift. Checks must be re-run every 6 months. Publishers change platforms, robots.txt updates, Plan S deadlines pass. A score stamped more than 6 months old is stale.
- Adversarial gaming. Some individual metrics are cheap to fix without fixing underlying discoverability (e.g. adding
citation_*meta tags without fixing HTML full-text availability). The composite is harder to game because it spans registries, markup, crawler posture, and access — but it is not impossible. - JavaScript-rendered pages. Some publisher platforms serve initial HTML without the article body and require JS execution to populate the page. The Scholar & SEO checks penalise this heavily (B4, D6) but do not attempt to execute JS. This is intentional — if Google Scholar's crawler does not see the content, this ranking should not count it.
- Disagreements about weights. Users who disagree with the category weights can recompute their own composite from the per-category scores in the database. The per-category scores are the primary deliverable; the composite is a summary.
- Conservative pass rules. The methodology errs toward false negatives — if we cannot confirm a check passes from public data, we mark it as fail. This may disadvantage journals whose compliance is real but obscured.
- Single timepoint. A single measurement at a single moment. The April 2026 capture will be re-run every six months and results versioned.
7. Validation plan
Phase 1 (this document) is the infrastructure measurement. Phase 2 is the validation step:
- For each ranked journal, select 5 articles published in the past 12 months, stratified by citation count (one each from the top quartile, median, and bottom quartile).
- For each article, take its exact title and submit as a Google Scholar query.
- Record the SERP rank at which the article appears (1 if first result, higher if further down, NA if not in top 20).
- Compute the mean SERP rank per journal (lower is better).
- Compute Spearman rank correlation between the composite SEO score and inverted mean SERP rank.
If the correlation coefficient is below 0.4, the weights are wrong and the methodology will be revised before publication. If the correlation is between 0.4 and 0.6, the methodology will be published with a note about partial validation. If the correlation is above 0.6, the methodology will be published as-is. Results will be reported in a separate VALIDATION.md document regardless of which way they come out.
Phase 3 is the AI-citation experiment: query ChatGPT, Claude, Perplexity, and Google AI Overviews about specific findings from papers across the reachable and blocked tiers, record which URLs the systems cite, and measure whether blocked-tier papers are actually cited less often or less accurately. That is the experiment that turns the dual-mechanism finding from a hypothesised consequence into an observed one.
8. Reproducibility
Every score in the database can be reproduced by re-running build_journal_seo_db.py against the same journal list and the same sample article DOIs. The script is deterministic conditional on the upstream data sources. The database records:
- The exact timestamp of each check
- The sample article DOIs used for per-article checks
- The raw response or flag from each data source for each check
- A version pin for the methodology document
Disputes about scores are resolved by re-running the checks, not by adjudication.
9. What citation-based rankings already do
For completeness and to position this ranking clearly against existing work:
- Journal Citation Reports (Clarivate) — JCR Impact Factor. Mean citations per article over 2 years. Citation impact only. Paywalled.
- Google Scholar Metrics — h5-index, h5-median. h-index of articles published in the last 5 years. Citation impact only. Free.
- SCImago Journal Rank (SJR). Citation-weighted prestige metric incorporating the prestige of citing journals. Citation impact only. Free.
- Scopus CiteScore. Mean citations per document over 4 years. Citation impact only. Free.
- Nature Index. Share of authorship in a fixed list of "high-quality" journals. Citation impact proxy for a narrow subset.
None of these measure discoverability. This ranking is complementary, not a substitute.
10. Document history
- 2026-04-14 v1.0 — Initial methodology document. Phase 1 scope (top 50 biomedical journals). Multi-sample (k = 3) aggregation. Protego-validated robots.txt parsing. Elsevier ScienceDirect URL rewriting. Re-normalised composite for blocked journals.
This methodology is published under CC-BY 4.0 and is open to critique. Send methodology disputes to [email protected]. Score disputes require submitting a reproduction of the check that contradicts the database entry — the methodology commits to re-running the check and correcting if the upstream source has changed.
Read the announcement post: We measured 50 top biomedical journals. Only 5 are open to AI crawlers.