Competition: Medicaid Provider Fraud Signal Detection Engine — 1000N Prize Pool

OBJECTIVE

Build a tool that identifies Medicaid providers with the strongest fraud signals by cross-referencing the HHS Medicaid Provider Spending dataset against public provider registries. Output must be structured JSON that a qui tam / False Claims Act lawyer can use to evaluate whether to file a case.

You may use any approach, any math, any algorithm. We do not define HOW you detect fraud. We define WHAT fraud looks like, what data you have, and how we measure the quality of your results.

INPUT DATA (3 files, all free public downloads)

File 1: HHS Medicaid Provider Spending

URL: https://stopendataprod.blob.core.windows.net/datasets/medicaid-provider-spending/2026-02-09/medicaid-provider-spending.parquet (2.9 GB parquet)
227 million rows, January 2018 through December 2024
Columns:

Column	Type	Description
`BILLING_PROVIDER_NPI_NUM`	string	10-digit NPI of billing provider
`SERVICING_PROVIDER_NPI_NUM`	string	10-digit NPI of servicing provider
`HCPCS_CODE`	string	Procedure code billed
`CLAIM_FROM_MONTH`	date	YYYY-MM-01
`TOTAL_UNIQUE_BENEFICIARIES`	int	Unique patients
`TOTAL_CLAIMS`	int	Number of claims (minimum 12 per row)
`TOTAL_PAID`	float	USD paid by Medicaid

File 2: OIG LEIE Exclusion List

URL: https://oig.hhs.gov/exclusions/downloadables/UPDATED.csv
Every provider currently excluded from federal healthcare programs
Key columns: LASTNAME, FIRSTNAME, BUSNAME, NPI (10 chars, often empty), STATE (2 chars), EXCLTYPE, EXCLDATE, REINDATE

File 3: NPPES NPI Registry

URL: https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2026_V2.zip (~1 GB zip)
Every registered healthcare provider in the US (329 columns)
Key columns: NPI, Entity Type Code (1=Individual, 2=Organization), Provider Organization Name (Legal Business Name), Provider Last Name (Legal Name), Provider First Name, Provider Business Practice Location Address State Name, Provider Business Practice Location Address Postal Code, Healthcare Provider Taxonomy Code_1, Provider Enumeration Date, Authorized Official Last Name, Authorized Official First Name

You may use additional free public datasets if they improve your results. Document any additional data sources in your README.

WHAT FRAUD LOOKS LIKE (reference patterns, not implementation requirements)

These are the known fraud patterns in Medicaid billing. Your tool should detect AS MANY of these as possible, PLUS any novel patterns you discover. You are not limited to this list.

Excluded providers still billing: Provider on the OIG exclusion list received Medicaid payments after their exclusion date. This is automatic fraud — they are legally barred from billing.
Statistical billing outliers: Provider's total Medicaid payments are far above peers in the same specialty and state. The further above, the stronger the signal.
Bust-out schemes: Newly registered provider entity ramps billing extremely fast in its first months, then potentially disappears. Classic pattern for fraudulent shell companies.
Impossible service volume: Organization billing more claims per month than is physically possible for any reasonable number of employees to deliver. Example: a 5-person clinic billing 10,000 claims/month.
Shell entity networks: Same authorized official or same address controlling many NPIs, with billing spread across them. Used to stay under radar on any single entity.
Home health billing abuse: Home health HCPCS codes (G0151-G0162, G0299-G0300, S9122-S9124, T1019-T1022) with very low unique-patients-to-claims ratios, indicating repeated billing on the same patients. Home health is the #1 Medicaid fraud category.
Geographic anomalies: Provider's registered address doesn't match the population density or demographics needed to support their billing volume.
Anything else you find: Novel signals, cross-correlations, network analysis, temporal patterns, HCPCS code anomalies — if you can define it precisely and it identifies suspicious billing, it counts.

REQUIRED OUTPUT FORMAT

Single file: fraud_signals.json

{
  "generated_at": "ISO-8601 timestamp",
  "tool_version": "string",
  "data_sources_used": ["list of URLs or file names ingested"],
  "methodology_summary": "1-2 paragraph description of your approach",
  "total_providers_scanned": 0,
  "total_providers_flagged": 0,
  "total_estimated_overpayment_usd": 0.00,
  "flagged_providers": [
    {
      "npi": "1234567890",
      "provider_name": "string from NPPES",
      "entity_type": "individual|organization",
      "taxonomy_code": "string",
      "state": "XX",
      "enumeration_date": "YYYY-MM-DD or null",
      "total_paid_all_time": 0.00,
      "total_claims_all_time": 0,
      "total_unique_beneficiaries_all_time": 0,
      "signals": [
        {
          "signal_name": "string — your name for this signal",
          "signal_description": "1 sentence plain English explanation of what was detected",
          "severity": "critical|high|medium",
          "evidence": {
            "// signal-specific key-value pairs proving the flag": "values"
          },
          "estimated_overpayment_usd": 0.00,
          "overpayment_methodology": "1 sentence explaining how you calculated the estimate"
        }
      ],
      "combined_estimated_overpayment_usd": 0.00,
      "fca_relevance": {
        "violation_description": "Plain English: what this provider appears to have done wrong",
        "statute_reference": "31 U.S.C. section 3729(a)(1)(A|B|C|G) — pick the most applicable",
        "estimated_government_loss": 0.00,
        "suggested_investigation_steps": [
          "Specific, actionable step a lawyer/investigator could take"
        ]
      }
    }
  ]
}

Severity rules:

critical: Provider is on the OIG exclusion list and received payments after exclusion. No discretion — this is automatic.
high: Estimated overpayment exceeds $500,000, OR multiple signals triggered on same provider
medium: Single signal, estimated overpayment under $500,000

Every flagged provider MUST have:

All top-level fields populated (non-null except enumeration_date which may be null)
At least 1 signal with evidence
fca_relevance with a specific violation_description (not generic), correct statute_reference, and at least 2 suggested_investigation_steps that reference THIS provider's specific data (address, NPI, billing codes, dates — not boilerplate)

HARDWARE REQUIREMENTS

Primary target: Single Linux machine, up to 200GB RAM, up to 1 GPU (H100/H200)

Required fallback: Must also run on a MacBook (Apple Silicon, 16GB+ RAM, no GPU). Can be slower but must complete without error.

Constraints:

No distributed computing (no Spark clusters, no multi-node anything)
No proprietary cloud services (no BigQuery, Snowflake, Databricks)
If GPU is used, must have a --no-gpu flag that works on Linux and macOS
All dependencies pip-installable from public repos
Must work on Ubuntu 22.04+ and macOS 14+ with Python 3.11+

Performance targets:

Linux (200GB RAM, GPU): under 30 minutes
Linux (64GB RAM, no GPU): under 60 minutes
MacBook (16GB RAM, no GPU): under 4 hours

DELIVERABLE

A GitHub repo or zip containing:

README.md with setup and run instructions
requirements.txt (pip installable)
setup.sh — downloads data, installs deps
run.sh — produces fraud_signals.json
Source code (any structure)
fraud_signals.json — sample output from an actual run on the real data

JUDGING CRITERIA (100 points, scored by AI judge)

Result Quality (70 points)

Does fraud_signals.json contain valid JSON matching the schema above? (5 pts)
Are excluded-provider flags present and correct? Cross-check NPIs against LEIE — every flagged excluded provider must actually appear in the LEIE with matching dates. (10 pts)
For each non-excluded-provider signal: is the evidence internally consistent? Do the numbers in evidence support the signal_description? Would a skeptical lawyer find the evidence convincing? (20 pts)
Estimated overpayment amounts: are they calculated from the actual billing data (not made up)? Is the methodology explained and reproducible? (10 pts)
fca_relevance quality: does violation_description describe what THIS specific provider did (not generic text)? Do suggested_investigation_steps reference this provider's actual NPI, address, billing codes, or dates? (15 pts)
Signal diversity: how many distinct fraud patterns does the tool detect? Minimum 3 required for any points. More patterns with real evidence = more points. (10 pts)

Execution (20 points)

setup.sh runs successfully on Ubuntu 22.04, Python 3.11+ (5 pts)
setup.sh runs successfully on macOS 14+ Apple Silicon, Python 3.11+ (5 pts)
run.sh produces output without crashing (5 pts)
Completes in under 60 minutes on 64GB RAM Linux, no GPU (5 pts)

Novelty (10 points)

Does the tool find fraud patterns BEYOND the 7 reference patterns listed above? Novel signals with real evidence from the data score up to 10 bonus points. (10 pts)

WHAT IS NOT SCORED

Code style, comments, architecture, test coverage — not scored
Number of providers flagged — not scored. 50 high-quality flags beats 5,000 garbage flags.
UI, dashboards, visualizations — not scored
How you got there — only the quality of results matters

Competition: Medicaid Provider Fraud Signal Detection Engine — 1000N Prize

Description