Skip to content
Today's snapshot: 26,417 providers tracked
fonteum
Research
PricingDocs
Request a pilot →
Bulk dataset downloads · Reference

Anonymous, auditable, gz-compressed. One URL per source per snapshot.

Fonteum mirrors every federal-source snapshot to S3 with 90-day rolling retention. The bulk surface gives buyer dev teams, academic researchers, and AI agents one canonical URL per source family + per snapshot date — full 14-tuple provenance in response headers, sidecar manifest at manifest.json, and a cross-link to /verify/[snapshot-id] for SHA-256 hash-match against Fonteum’s integrity attestation.

Top-level manifest → Data catalog → Citation format →

1. Endpoints

Per-source + top-level. All anonymous.

Three endpoints per source family, plus one top-level discovery index. All anonymous (no Authorization header required), all returning the canonical X-Fonteum-* response headers.

  • GET /api/v1/bulk/<source>/latest.csv.gz — 302 redirect to the most recent S3-cached snapshot. Cache-Control: max-age=300 (rolls daily as new snapshots ingest).
  • GET /api/v1/bulk/<source>/<YYYY-MM-DD>.csv.gz — 302 redirect to a specific dated snapshot. Immutable per snapshot — Cache-Control: max-age=86400, immutable. Pin a date when you need reproducibility (cite specific snapshot in a paper, replay an analysis).
  • GET /api/v1/bulk/<source>/manifest.json — sidecar JSON listing every cached snapshot for the source with sha256 + size + verify_url + cached_at + retention_expires per row.
  • GET /api/v1/bulk/manifest.json — top-level index across all 8 source families. One URL for DataCite harvesters, dbt source registration, evaluation scripts.
2. Source families

Phase 1 — 8 federal sources.

One row per source family registered in the cron-sources registry. Each maps to a source_id value usable in the URL paths above.

  • cms-pecos — CMS PECOS PPEF (Provider Enrollment, Chain & Ownership) · Weekly (Sunday) · license: US-Government-Works
  • oig-leie — OIG LEIE (List of Excluded Individuals/Entities) · Monthly (1st of month) · license: US-Government-Works
  • hrsa-hpsa — HRSA HPSA (Health Professional Shortage Areas) · Quarterly (1st of Jan / Apr / Jul / Oct) · license: US-Government-Works
  • bls-oews — BLS OEWS (Occupational Employment & Wage Statistics) · Annual (mid-May) · license: US-Government-Works
  • bea-regional — BEA Regional Economic Accounts (state GDP) · Annual (mid-October) · license: US-Government-Works
  • cms-nppes — CMS NPPES NPI Registry · Quarterly (per specialty, operator-triggered) · license: US-Government-Works
  • cms-care-compare — CMS Care Compare (per facility type) · Quarterly (per facility type, operator-triggered) · license: US-Government-Works
  • cms-open-payments — CMS Open Payments (Sunshine Act — General Payments) · Annual (mid-June / late June) · license: US-Government-Works
  • cms-hcris-hospital-2552-10 — CMS HCRIS Hospital Cost Reports (form CMS-2552-10) · Annual (operator-triggered ~November) · license: US-Government-Works
  • cms-qpp-mips — CMS QPP MIPS Individual + Group Scores · Annual (operator-triggered ~July, post-performance-year scoring) · license: US-Government-Works
  • cms-provider-utilization — CMS Medicare Provider Utilization & Payment Data (Physician & Other Practitioners by Provider and Service) · Annual (mid-June release of prior data year) · license: US-Government-Works
  • cms-inpatient-utilization — CMS Medicare Inpatient Hospitals by Provider and Service · Annual (mid-June release of prior data year) · license: US-Government-Works
  • cms-outpatient-utilization — CMS Medicare Outpatient Hospitals by Provider and Service · Annual (mid-June release of prior data year) · license: US-Government-Works
  • hrsa-uds — HRSA Uniform Data System (UDS) · Annual (May, post-grant-year reporting) · license: US-Government-Works
  • cms-pos — CMS Provider of Services (POS) — iQIES Facility Registry · Quarterly (operator-triggered; CMS publishes Q1–Q4) · license: US-Government-Works
3. Format

Phase 1 ships gzipped CSV. Parquet + JSON Lines queued.

Each archive is the upstream source CSV exactly as captured at ingestion time, gzip-compressed (application/gzip). Header rows + column ordering match the upstream source — Fonteum does not normalize, dedupe, or transform.

Phase 2 format alternatives queued separately: Parquet (§sprint3-bulk-export-parquet), JSON Lines (§sprint3-bulk-export-jsonl), FHIR Bundle (§sprint3-bulk-export-fhir-bundle), partitioned per-state / per-vertical (§sprint3-bulk-export-partitioned).

4. Response headers

14-tuple provenance + hash-match cross-link.

Every 302 redirect carries:

  • X-Fonteum-Source — source_id of the dataset
  • X-Fonteum-Snapshot-Date — ISO-8601 date of the snapshot
  • X-Fonteum-SHA256 — 64-char lowercase hex; matches snapshot_attestations.content_hash
  • X-Fonteum-License — SPDX identifier (e.g. US-Government-Works, CC-BY-4.0)
  • X-Fonteum-Cite — citation format URL (/cite)
  • X-Fonteum-Verify — hash-match endpoint (/verify/<snapshot-id>)
  • Link — sidecar manifest URL with rel="describedby"
5. Hash-match flow

Hash-match against the snapshot attestation.

The bulk surface is content-addressable: every snapshot has one SHA-256, that hash is attested in snapshot_attestations at ingestion time, and both the response header and the /verify/[snapshot-id] endpoint return the same value. Defense in depth — one consumer, three independent hash-match paths.

# 1. Read the manifest to find the latest snapshot date + SHA-256
curl -s https://fonteum.com/api/v1/bulk/cms-pecos/manifest.json \
  | jq '.snapshots[0] | {snapshot_date, sha256, cache_url}'

# 2. Download the gzipped CSV (302-redirect resolves to S3)
curl -L -o pecos-latest.csv.gz \
  https://fonteum.com/api/v1/bulk/cms-pecos/latest.csv.gz

# 3. Recompute the SHA-256 locally and compare to the header value
EXPECTED=$(curl -sI https://fonteum.com/api/v1/bulk/cms-pecos/latest.csv.gz \
  | awk -F': ' 'tolower($1)=="x-fonteum-sha256" {print tolower($2)}' \
  | tr -d '\r')
ACTUAL=$(shasum -a 256 pecos-latest.csv.gz | awk '{print $1}')
[ "$EXPECTED" = "$ACTUAL" ] && echo "ok" || echo "MISMATCH"

# 4. Cross-check against the /verify endpoint (defense in depth)
SNAPSHOT_ID=$(curl -sI https://fonteum.com/api/v1/bulk/cms-pecos/latest.csv.gz \
  | awk -F'/' 'tolower($1)~/x-fonteum-verify/ {print $NF}' | tr -d '\r')
curl -s -H 'Accept: text/plain' https://fonteum.com/verify/$SNAPSHOT_ID
# returns the 64-char hex hash; should equal $EXPECTED + $ACTUAL
6. Python (3.10+ stdlib)

urllib + gzip + hashlib.

# Python 3.10+ (stdlib only — urllib + gzip + hashlib)
import gzip
import hashlib
import urllib.request

URL = "https://fonteum.com/api/v1/bulk/cms-pecos/latest.csv.gz"

# 302 redirect resolves automatically; capture headers from the response
req = urllib.request.Request(URL)
with urllib.request.urlopen(req) as resp:
    raw = resp.read()
    expected_sha = resp.headers.get("X-Fonteum-SHA256", "").lower()
    snapshot_date = resp.headers.get("X-Fonteum-Snapshot-Date", "")

# Verify the hash matches what Fonteum signed
actual_sha = hashlib.sha256(raw).hexdigest()
assert actual_sha == expected_sha, f"SHA mismatch: {actual_sha} != {expected_sha}"

# Decompress + iterate
import io, csv
with gzip.GzipFile(fileobj=io.BytesIO(raw), mode="rb") as gz:
    reader = csv.DictReader(io.TextIOWrapper(gz, encoding="utf-8"))
    for row in reader:
        # ... your analysis here ...
        pass

print(f"Hash-matched snapshot {snapshot_date} ({len(raw):,} bytes gzipped)")
7. R (4.0+)

readr + digest + httr.

# R 4.0+ (readr + digest + httr)
library(readr)
library(digest)
library(httr)

url <- "https://fonteum.com/api/v1/bulk/cms-pecos/latest.csv.gz"

# httr::GET follows 302 by default + exposes headers
res <- GET(url)
stop_for_status(res)
expected_sha <- tolower(headers(res)[["x-fonteum-sha256"]])
raw <- content(res, "raw")

# Recompute SHA-256 over the gzipped bytes (same as the header)
actual_sha <- digest(raw, algo = "sha256", serialize = FALSE)
stopifnot(expected_sha == actual_sha)

# readr can read gzipped CSV directly from a connection
df <- read_csv(rawConnection(raw))
message(sprintf("Hash-matched snapshot — %d rows, %d cols", nrow(df), ncol(df)))
8. Node.js (18+)

stdlib fetch + crypto + zlib.

// Node 18+ (built-in fetch + crypto + zlib)
import { createHash } from "node:crypto";
import { gunzipSync } from "node:zlib";

const URL = "https://fonteum.com/api/v1/bulk/cms-pecos/latest.csv.gz";

const res = await fetch(URL, { redirect: "follow" });
if (!res.ok) throw new Error(`HTTP ${res.status}`);
const expected = res.headers.get("x-fonteum-sha256")?.toLowerCase();
const buf = Buffer.from(await res.arrayBuffer());

const actual = createHash("sha256").update(buf).digest("hex");
if (actual !== expected) throw new Error(`sha mismatch ${actual} != ${expected}`);

const csv = gunzipSync(buf).toString("utf-8");
console.log(`Hash-matched ${csv.split("\n").length} rows`);
9. License + citation

Per-source SPDX. Cite Fonteum when used in publications.

Federal sources (CMS, OIG, HRSA, BLS, BEA) are US-Government-Works — public domain, redistribution allowed. Fonteum-derived datasets carry CC-BY-4.0 requiring attribution. The X-Fonteum-License header surfaces the SPDX value on every response.

For papers, theses, dashboards, or commercial products: cite Fonteum per /cite (APA + AMA + BibTeX). Pin the snapshot_date when reproducibility matters; a dated URL is immutable.

Phase roadmap

Phase 1 ships gzipped CSV. Phase 2 — formats + partitions.

  • Phase 1 (this wave): 8 source families × (latest + dated + manifest) endpoints + top-level manifest + /data-catalog distribution surfacing.
  • §sprint3-bulk-export-parquet (queued): Parquet format alternative — same URLs with .parquet suffix.
  • §sprint3-bulk-export-jsonl (queued): JSON Lines format — .jsonl.gz.
  • §sprint3-bulk-export-partitioned (queued): per-state + per-vertical splits.
  • §sprint3-bulk-export-fhir-bundle (queued): FHIR Bundle format for FHIR-aligned consumers.
  • §sprint3-datacite-bulk-listing (queued): DataCite metadata harvester registration so federal catalogs auto-discover the bulk surface.

Compliance posture

Methodology · Corrections log · Editorial policy

fonteum

Healthcare provider data, traced to source.


PLATFORM

  • Data platform
  • Pricing
  • FHIR API docs
  • For health-tech

RESEARCH

  • Research hub
  • Nursing homes
  • Methodology
  • Methodology changelog

COMPANY

  • About
  • Press
  • Contact
  • Trust & integrity

LEGAL

  • Privacy policy
  • Editorial policy
  • Corrections log

© 2026 FONTEUM RESEARCH · DATA SNAPSHOT MAY 8, 2026 · BUILT WITH CARE

  • X
  • LINKEDIN
  • PRESS