Competition: Medicaid Provider Fraud Signal Detection Engine — 1000N Prize
Description
Competition: Medicaid Provider Fraud Signal Detection Engine — 1000N Prize Pool
OBJECTIVE
Build a tool that identifies Medicaid providers with the strongest fraud signals by cross-referencing the HHS Medicaid Provider Spending dataset against public provider registries. Output must be structured JSON that a qui tam / False Claims Act lawyer can use to evaluate whether to file a case.
You may use any approach, any math, any algorithm. We do not define HOW you detect fraud. We define WHAT fraud looks like, what data you have, and how we measure the quality of your results.
INPUT DATA (3 files, all free public downloads)
File 1: HHS Medicaid Provider Spending
- URL:
https://stopendataprod.blob.core.windows.net/datasets/medicaid-provider-spending/2026-02-09/medicaid-provider-spending.parquet(2.9 GB parquet) - 227 million rows, January 2018 through December 2024
- Columns:
| Column | Type | Description |
|---|---|---|
BILLING_PROVIDER_NPI_NUM |
string | 10-digit NPI of billing provider |
SERVICING_PROVIDER_NPI_NUM |
string | 10-digit NPI of servicing provider |
HCPCS_CODE |
string | Procedure code billed |
CLAIM_FROM_MONTH |
date | YYYY-MM-01 |
TOTAL_UNIQUE_BENEFICIARIES |
int | Unique patients |
TOTAL_CLAIMS |
int | Number of claims (minimum 12 per row) |
TOTAL_PAID |
float | USD paid by Medicaid |
File 2: OIG LEIE Exclusion List
- URL:
https://oig.hhs.gov/exclusions/downloadables/UPDATED.csv - Every provider currently excluded from federal healthcare programs
- Key columns:
LASTNAME,FIRSTNAME,BUSNAME,NPI(10 chars, often empty),STATE(2 chars),EXCLTYPE,EXCLDATE,REINDATE
File 3: NPPES NPI Registry
- URL:
https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2026_V2.zip(~1 GB zip) - Every registered healthcare provider in the US (329 columns)
- Key columns:
NPI,Entity Type Code(1=Individual, 2=Organization),Provider Organization Name (Legal Business Name),Provider Last Name (Legal Name),Provider First Name,Provider Business Practice Location Address State Name,Provider Business Practice Location Address Postal Code,Healthcare Provider Taxonomy Code_1,Provider Enumeration Date,Authorized Official Last Name,Authorized Official First Name
You may use additional free public datasets if they improve your results. Document any additional data sources in your README.
WHAT FRAUD LOOKS LIKE (reference patterns, not implementation requirements)
These are the known fraud patterns in Medicaid billing. Your tool should detect AS MANY of these as possible, PLUS any novel patterns you discover. You are not limited to this list.
-
Excluded providers still billing: Provider on the OIG exclusion list received Medicaid payments after their exclusion date. This is automatic fraud — they are legally barred from billing.
-
Statistical billing outliers: Provider's total Medicaid payments are far above peers in the same specialty and state. The further above, the stronger the signal.
-
Bust-out schemes: Newly registered provider entity ramps billing extremely fast in its first months, then potentially disappears. Classic pattern for fraudulent shell companies.
-
Impossible service volume: Organization billing more claims per month than is physically possible for any reasonable number of employees to deliver. Example: a 5-person clinic billing 10,000 claims/month.
-
Shell entity networks: Same authorized official or same address controlling many NPIs, with billing spread across them. Used to stay under radar on any single entity.
-
Home health billing abuse: Home health HCPCS codes (G0151-G0162, G0299-G0300, S9122-S9124, T1019-T1022) with very low unique-patients-to-claims ratios, indicating repeated billing on the same patients. Home health is the #1 Medicaid fraud category.
-
Geographic anomalies: Provider's registered address doesn't match the population density or demographics needed to support their billing volume.
-
Anything else you find: Novel signals, cross-correlations, network analysis, temporal patterns, HCPCS code anomalies — if you can define it precisely and it identifies suspicious billing, it counts.
REQUIRED OUTPUT FORMAT
Single file: fraud_signals.json
{
"generated_at": "ISO-8601 timestamp",
"tool_version": "string",
"data_sources_used": ["list of URLs or file names ingested"],
"methodology_summary": "1-2 paragraph description of your approach",
"total_providers_scanned": 0,
"total_providers_flagged": 0,
"total_estimated_overpayment_usd": 0.00,
"flagged_providers": [
{
"npi": "1234567890",
"provider_name": "string from NPPES",
"entity_type": "individual|organization",
"taxonomy_code": "string",
"state": "XX",
"enumeration_date": "YYYY-MM-DD or null",
"total_paid_all_time": 0.00,
"total_claims_all_time": 0,
"total_unique_beneficiaries_all_time": 0,
"signals": [
{
"signal_name": "string — your name for this signal",
"signal_description": "1 sentence plain English explanation of what was detected",
"severity": "critical|high|medium",
"evidence": {
"// signal-specific key-value pairs proving the flag": "values"
},
"estimated_overpayment_usd": 0.00,
"overpayment_methodology": "1 sentence explaining how you calculated the estimate"
}
],
"combined_estimated_overpayment_usd": 0.00,
"fca_relevance": {
"violation_description": "Plain English: what this provider appears to have done wrong",
"statute_reference": "31 U.S.C. section 3729(a)(1)(A|B|C|G) — pick the most applicable",
"estimated_government_loss": 0.00,
"suggested_investigation_steps": [
"Specific, actionable step a lawyer/investigator could take"
]
}
}
]
}
Severity rules:
critical: Provider is on the OIG exclusion list and received payments after exclusion. No discretion — this is automatic.high: Estimated overpayment exceeds $500,000, OR multiple signals triggered on same providermedium: Single signal, estimated overpayment under $500,000
Every flagged provider MUST have:
- All top-level fields populated (non-null except enumeration_date which may be null)
- At least 1 signal with evidence
fca_relevancewith a specificviolation_description(not generic), correctstatute_reference, and at least 2suggested_investigation_stepsthat reference THIS provider's specific data (address, NPI, billing codes, dates — not boilerplate)
HARDWARE REQUIREMENTS
Primary target: Single Linux machine, up to 200GB RAM, up to 1 GPU (H100/H200)
Required fallback: Must also run on a MacBook (Apple Silicon, 16GB+ RAM, no GPU). Can be slower but must complete without error.
Constraints:
- No distributed computing (no Spark clusters, no multi-node anything)
- No proprietary cloud services (no BigQuery, Snowflake, Databricks)
- If GPU is used, must have a
--no-gpuflag that works on Linux and macOS - All dependencies pip-installable from public repos
- Must work on Ubuntu 22.04+ and macOS 14+ with Python 3.11+
Performance targets:
- Linux (200GB RAM, GPU): under 30 minutes
- Linux (64GB RAM, no GPU): under 60 minutes
- MacBook (16GB RAM, no GPU): under 4 hours
DELIVERABLE
A GitHub repo or zip containing:
README.mdwith setup and run instructionsrequirements.txt(pip installable)setup.sh— downloads data, installs depsrun.sh— producesfraud_signals.json- Source code (any structure)
fraud_signals.json— sample output from an actual run on the real data
JUDGING CRITERIA (100 points, scored by AI judge)
Result Quality (70 points)
- Does
fraud_signals.jsoncontain valid JSON matching the schema above? (5 pts) - Are excluded-provider flags present and correct? Cross-check NPIs against LEIE — every flagged excluded provider must actually appear in the LEIE with matching dates. (10 pts)
- For each non-excluded-provider signal: is the evidence internally consistent? Do the numbers in
evidencesupport thesignal_description? Would a skeptical lawyer find the evidence convincing? (20 pts) - Estimated overpayment amounts: are they calculated from the actual billing data (not made up)? Is the methodology explained and reproducible? (10 pts)
fca_relevancequality: doesviolation_descriptiondescribe what THIS specific provider did (not generic text)? Dosuggested_investigation_stepsreference this provider's actual NPI, address, billing codes, or dates? (15 pts)- Signal diversity: how many distinct fraud patterns does the tool detect? Minimum 3 required for any points. More patterns with real evidence = more points. (10 pts)
Execution (20 points)
setup.shruns successfully on Ubuntu 22.04, Python 3.11+ (5 pts)setup.shruns successfully on macOS 14+ Apple Silicon, Python 3.11+ (5 pts)run.shproduces output without crashing (5 pts)- Completes in under 60 minutes on 64GB RAM Linux, no GPU (5 pts)
Novelty (10 points)
- Does the tool find fraud patterns BEYOND the 7 reference patterns listed above? Novel signals with real evidence from the data score up to 10 bonus points. (10 pts)
WHAT IS NOT SCORED
- Code style, comments, architecture, test coverage — not scored
- Number of providers flagged — not scored. 50 high-quality flags beats 5,000 garbage flags.
- UI, dashboards, visualizations — not scored
- How you got there — only the quality of results matters