# Journal SEO Ranking — Methodology

**Version:** 1.0 (2026-04-14)
**Author:** Girishkumar Kumaran, PhD — Academic SEO (academicseo.co.uk)
**Scope:** Phase 1 — top 50 biomedical journals ranked by OpenAlex citation influence
**Validation status:** Composite weights are priors. SERP-rank validation described in §7 is run after the initial build; results reported in `VALIDATION.md`.

---

## 1. What this ranking measures — and what it does not

This ranking scores journals on **discoverability infrastructure**: how well a journal's platform helps an article get found on Google Scholar, PubMed, general Google, and AI search engines (Google AI Overviews, ChatGPT, Perplexity, Bing Copilot) *after* it is published.

It does **not** measure:

- Citation impact (use Journal Impact Factor, h5-index, SCImago Journal Rank, or Scopus CiteScore)
- Editorial quality or peer review rigor
- Journal prestige or reputation
- Rejection rate or publication speed
- Author satisfaction or submission experience

Citation-based rankings answer: "once a paper is read, how much is it cited?" This ranking answers: "can a paper be found in the first place?" The two are complementary, not substitutes. A journal can have a high Impact Factor and still score poorly here if its platform has weak technical SEO, and vice versa.

## 2. Why this ranking exists

No established journal SEO ranking system currently exists. Citation-based journal metrics (JCR, Scopus, SJR, Google Scholar Metrics) measure citation impact and prestige; the DOAJ Seal was retired in April 2025; Plan S compliance indicators are binary and cover only OA journals. The closest existing work is academic SEO best-practice literature (Beel & Gipp, Scholastica blog, De Gruyter author resources) which lists factors without combining them into a comparable score per journal.

The gap is particularly consequential because:

1. Discoverability infrastructure is heavily author-facing but invisible at submission time. PIs choose journals based on impact and fit; they inherit the platform's SEO quality with no warning.
2. Google's March 2026 core update introduced site-wide quality aggregation, meaning a journal platform's weaker article pages drag down its stronger ones in the same way a paper's weak abstract can drag down its full text.
3. AI search engines now cite scholarly content heavily in Overviews and conversation answers, and the crawler posture of publisher platforms toward GPTBot / ClaudeBot / PerplexityBot directly determines whether a given paper can be cited in those answers.

## 3. Data sources

Every metric is derived from one of the following public, machine-readable sources. No scraping of gated content, no circumvention of robots.txt, no private APIs.

| Source | Used for | API / URL |
|---|---|---|
| OpenAlex | Journal list, citation metrics, DOAJ status, is_oa, ISSN, homepage, sample article IDs | `https://api.openalex.org/sources`, `https://api.openalex.org/works` |
| CrossRef REST API | DOI registration, member metadata completeness, reference deposition | `https://api.crossref.org/journals/{issn}` |
| DOAJ API | OA status, license, APC, machine-readability flags | `https://doaj.org/api/search/journals/{issn}` |
| NLM Catalog (E-utilities) | MEDLINE indexing, PubMed Central auto-deposition | `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/` |
| Publisher robots.txt | AI crawler posture (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) | `https://{domain}/robots.txt` |
| Sample article HTML | Highwire Press `citation_*` meta tags, Schema.org Article, canonical, OG tags, JATS XML link, HTML full text availability, ORCID display, abstract in HTML, semantic structure | Fetched with honest `User-Agent: AcademicSEO-Research/1.0 (mailto:gkumar@academicseo.co.uk)` |

Sample article selection: for each journal, the most recent original research article in OpenAlex with `type:article`, sorted by `publication_date:desc`, first hit. The sampled DOI and URL are recorded in the DB so scores are reproducible and auditable.

## 4. The 5 metric categories and 27 checks

Every check is either binary (pass/fail, 1 or 0) or continuous (0–1). Weights are reported below in §5.

### Category A — Indexing & registry coverage (20%)

| # | Check | Source | Type |
|---|---|---|---|
| A1 | MEDLINE indexed | NLM Catalog | binary |
| A2 | PubMed Central auto-deposition | NLM Catalog | binary |
| A3 | DOAJ listed | DOAJ / OpenAlex | binary |
| A4 | CrossRef member with active DOI registration | CrossRef | binary |
| A5 | OpenAlex coverage with ≥90% of published works indexed | OpenAlex | continuous |
| A6 | Scopus or Web of Science indexed (OpenAlex `is_core:true`) | OpenAlex | binary |
| A7 | Valid ISSN (both print and electronic where applicable) | OpenAlex | binary |

*Rationale:* Registry presence is a hard ceiling on visibility. A paper in a journal that is not MEDLINE-indexed cannot appear in PubMed regardless of how good its metadata is. Weighted 20% — not higher, because most respectable biomedical journals clear the registry bar, and the differentiation happens in the next categories.

### Category B — Scholar & search technical SEO (25%)

This is the heaviest-weighted category because it dominates the two highest-volume search surfaces (Google Scholar, general Google).

| # | Check | Source | Type |
|---|---|---|---|
| B1 | **Highwire Press `citation_*` meta tags complete** (title, author, publication_date, journal_title, volume, issue, firstpage, lastpage, pdf_url, doi) | Sample article HTML | continuous (N_present / 10) |
| B2 | Schema.org `Article` or `ScholarlyArticle` JSON-LD present and valid | Sample article HTML | binary |
| B3 | Canonical tag present and matches article URL | Sample article HTML | binary |
| B4 | HTML full text available (not PDF-only) | Sample article HTML | binary |
| B5 | PDF has selectable text layer (heuristic: `application/pdf` response + file size check against scan-like thresholds) | Sample article PDF header | binary |
| B6 | Open Graph tags complete (og:title, og:description, og:image, og:type) | Sample article HTML | continuous |
| B7 | HTTPS on article URL | Sample article URL | binary |
| B8 | Clean URL structure (no raw query strings, DOI or slug in path) | Sample article URL | binary |
| B9 | Semantic HTML (single h1, sectioning via `<section>` / `<article>`, figures with `<figcaption>`) | Sample article HTML | continuous |

*Rationale:* B1 is the single most important check in the entire framework. Google Scholar's own inclusion guidelines explicitly prefer Highwire Press `citation_*` meta tags and discourage Dublin Core for journal content. A journal missing these tags cannot be indexed in Scholar properly, regardless of how good everything else on the page is. It gets ~40% of the weight within Category B (≈10% of the composite).

### Category C — Metadata richness & Plan S compliance (20%)

| # | Check | Source | Type |
|---|---|---|---|
| C1 | JATS XML exposure (linked from article landing page or machine-readable) | Sample article HTML + CrossRef | binary |
| C2 | Abstract present in HTML (not behind JS or login) | Sample article HTML | binary |
| C3 | MeSH terms or keywords in HTML microdata or meta tags | Sample article HTML | binary |
| C4 | Author ORCID iDs displayed and marked up | Sample article HTML | binary |
| C5 | Author affiliations with ROR IDs | Sample article HTML | binary |
| C6 | Funding statements with machine-readable grant IDs | Sample article HTML + CrossRef | binary |
| C7 | Reference list machine-readable with DOIs/PMIDs; CrossRef cited-references deposited | CrossRef | binary |
| C8 | Data availability statement present | Sample article HTML | binary |
| C9 | Figure alt text populated | Sample article HTML | continuous |

*Rationale:* Plan S compliance criteria anchor this category. These checks map to specific Plan S technical requirements (machine-readable metadata, JATS XML, CC0 metadata licensing) and to the fields that Google AI Overviews and Perplexity use when extracting structured answers from scholarly sources.

### Category D — AI search posture (15%)

| # | Check | Source | Type |
|---|---|---|---|
| D1 | `robots.txt` allows GPTBot | Journal `robots.txt` | binary |
| D2 | `robots.txt` allows ClaudeBot | Journal `robots.txt` | binary |
| D3 | `robots.txt` allows PerplexityBot | Journal `robots.txt` | binary |
| D4 | `robots.txt` allows Google-Extended (distinct from Googlebot) | Journal `robots.txt` | binary |
| D5 | `llms.txt` present at root | Journal root | binary (bonus) |
| D6 | Article HTML served without JS rendering (initial HTML contains abstract text) | Sample article HTML | binary |

*Rationale:* A publisher blocking AI crawlers in robots.txt cannot have its articles cited in AI Overviews. This category differentiates journals harshly because many major publishers currently block all AI user agents; a handful explicitly allow them. The differentiation is real and currently predictive of AI citation presence.

### Category E — Access, openness & licensing (20%)

| # | Check | Source | Type |
|---|---|---|---|
| E1 | Open Access status: gold (3), hybrid (2), green (1), none (0) | OpenAlex / DOAJ | continuous |
| E2 | CC0 or CC-BY metadata licensing (Plan S requirement) | DOAJ / CrossRef | binary |
| E3 | Machine-readable license tag in article HTML | Sample article HTML | binary |
| E4 | Machine-readable OA status tag in article HTML | Sample article HTML | binary |
| E5 | PubMed Central auto-deposition for articles | NLM Catalog | binary |
| E6 | Preprint-friendly policy (accepted without embargo) | Journal policy page | binary |
| E7 | APC disclosed clearly on a public page | DOAJ / homepage | binary |

*Rationale:* Google Scholar explicitly weights free full-text availability in its ranking algorithm. Plan S technical criteria require machine-readable license and OA status metadata. Both factors directly influence whether a paper is surfaced to a non-subscribed reader or an AI crawler without a paywall redirect.

## 5. Weights and composite score

```
Composite = 0.20 × A + 0.25 × B + 0.20 × C + 0.15 × D + 0.20 × E
```

Category scores are normalised to 0–1 before the composite is computed. Within each category, individual checks are weighted uniformly unless otherwise noted (B1 is the exception — see §4).

The composite is mapped to four tier bands:

- **Excellent** — 85–100
- **Good** — 70–84
- **Adequate** — 50–69
- **Poor** — below 50

**Weights are priors, not evidence.** They reflect a considered view of which categories matter most for discoverability, but they are not empirically derived. Category D is weighted 15% (not higher) because AI crawler access is a rapidly changing factor and some publishers are likely to shift posture within months. Category B is weighted 25% because Google Scholar remains the dominant scholarly search engine and its ranking is most sensitive to this category.

**The validation step in §7 is the honest check on whether the weights are approximately right.** If the composite score does not correlate with real Google Scholar ranking of sampled papers, the weights are wrong and the methodology document is updated to say so.

## 6. Known limitations — stated honestly

1. **Sample article bias.** One article per journal. A journal's technical SEO may vary between article pages (e.g. older articles on a legacy template). The sampled DOI is recorded in the DB for auditability. Future versions may sample 3 articles and use the median.

2. **Score drift.** Checks must be re-run every 6 months. Publishers change platforms, robots.txt updates, Plan S compliance deadlines pass. A score stamped more than 6 months old is stale.

3. **Adversarial gaming.** Some individual metrics are cheap to fix without fixing underlying discoverability (e.g. adding `citation_*` meta tags without fixing the HTML full text availability). The composite is harder to game because it spans registries, markup, crawler posture, and access — but it is not impossible. The methodology doc lists which metrics are gameable.

4. **Sample article selection bias within journals.** OpenAlex's most-recent-article-first selection preferentially samples the journal's current platform. Journals in the process of a platform migration may have two templates live at once; the score reflects whichever was sampled.

5. **JavaScript-rendered pages.** Some publisher platforms serve the initial HTML without the article body and require JS execution to populate the page. The Scholar & SEO checks in Category B penalise this heavily (D6, B4) but do not attempt to execute JS. This is intentional — if Google Scholar's crawler does not see the content, this ranking should not count it.

6. **Disagreements about weights.** Users who disagree with the category weights can recompute their own composite from the per-category scores in the DB. The per-category scores are the primary deliverable; the composite is a summary.

7. **Checks that depend on publisher-specific patterns.** Some checks (e.g. JATS XML exposure on the article landing page) are expected to be present differently across publishers. The methodology errs toward false negatives — if we cannot confirm the check passes from public data, we mark it as fail. This may disadvantage journals whose compliance is real but obscured.

## 7. Validation plan

Immediately after the initial DB is built, the following validation is run and results are reported in `VALIDATION.md`:

1. For each ranked journal, select 5 articles published in the past 12 months, stratified by citation count (one each from the top quartile, median, and bottom quartile by citation count within the journal's past year).
2. For each article, take its exact title and submit as a Google Scholar query.
3. Record the SERP rank at which the article appears (1 if first result, higher if further down, NA if not in top 20).
4. Compute the mean SERP rank per journal (lower is better).
5. Compute Spearman rank correlation between the composite SEO score and inverted mean SERP rank.

If the correlation coefficient is below 0.4, the weights are wrong and the methodology is revised before publication. If the correlation is between 0.4 and 0.6, the methodology is published with a note about partial validation. If the correlation is above 0.6, the methodology is published as-is.

The validation set is stored in `validation_articles.csv` for reproducibility.

## 8. Reproducibility

Every score in the DB can be reproduced by re-running `build_journal_seo_db.py` against the same journal list and the same sample article DOIs. The script is deterministic conditional on the upstream data sources. The DB records:

- The exact timestamp of each check
- The sample article DOI used for per-article checks
- The raw response or flag from each data source for each check
- A git commit hash of the scoring code

Disputes about scores are resolved by re-running the checks, not by adjudication.

## 9. What citation-based rankings already do

For completeness and to position this ranking clearly against existing work:

- **Journal Citation Reports (Clarivate) — JCR Impact Factor.** Mean citations per article over 2 years. Citation impact only. Paywalled.
- **Google Scholar Metrics — h5-index, h5-median.** h-index of articles published in the last 5 years. Citation impact only. Free.
- **SCImago Journal Rank (SJR).** Citation-weighted prestige metric incorporating the prestige of citing journals. Citation impact only. Free.
- **Scopus CiteScore.** Mean citations per document over 4 years. Citation impact only. Free.
- **Nature Index.** Share of authorship in a fixed list of "high-quality" journals. Citation impact proxy for a narrow subset.

None of these measure discoverability. This ranking is complementary.

## 10. Document history

- **2026-04-14 v1.0** — Initial methodology document. Phase 1 scope (top 50 biomedical journals).

---

*This methodology is published under CC-BY 4.0 and is open to critique. Send methodology disputes to gkumar@academicseo.co.uk. Score disputes require submitting a reproduction of the check that contradicts the DB entry — the methodology commits to re-running the check and correcting if the upstream source has changed.*
