Competition: Medicaid Provider Fraud Signal Detection Engine — 1000N Prize
Description
Competition: Medicaid Provider Fraud Signal Detection Engine
OBJECTIVE
Build a CLI tool that:
- Ingests the HHS Medicaid Provider Spending dataset (2.9GB parquet, 227M rows)
- Cross-references providers against the OIG LEIE exclusion list and the NPPES NPI registry
- Outputs a structured JSON file containing provider-level fraud signal reports usable by qui tam / FCA lawyers
REQUIRED INPUT DATA (3 files, all free public downloads)
File 1: HHS Medicaid Provider Spending
- Download:
https://stopendataprod.blob.core.windows.net/datasets/medicaid-provider-spending/2026-02-09/medicaid-provider-spending.parquet(2.9 GB) - Columns (7 total):
| Column | Type | Description |
|---|---|---|
BILLING_PROVIDER_NPI_NUM |
string | 10-digit NPI of billing provider |
SERVICING_PROVIDER_NPI_NUM |
string | 10-digit NPI of servicing provider |
HCPCS_CODE |
string | Procedure code |
CLAIM_FROM_MONTH |
date | Format: YYYY-MM-01 |
TOTAL_UNIQUE_BENEFICIARIES |
int | Count of unique patients |
TOTAL_CLAIMS |
int | Number of claims (minimum 12 per row) |
TOTAL_PAID |
float | USD paid by Medicaid |
- Coverage: January 2018 through December 2024
- Rows with fewer than 12 claims are excluded by HHS
File 2: OIG LEIE Exclusion List
- Download:
https://oig.hhs.gov/exclusions/downloadables/UPDATED.csv - Columns (18 total):
| Column | Type | Max Length |
|---|---|---|
LASTNAME |
string | 20 |
FIRSTNAME |
string | 15 |
MIDNAME |
string | 15 |
BUSNAME |
string | 30 |
GENERAL |
string | - |
SPECIALTY |
string | - |
UPIN |
string | - |
NPI |
string | 10 |
DOB |
string | 8 |
ADDRESS |
string | 30 |
CITY |
string | 20 |
STATE |
string | 2 |
ZIP |
string | 5 |
EXCLTYPE |
string | 9 |
EXCLDATE |
string | 8 |
REINDATE |
string | 8 |
WAIVERDATE |
string | 8 |
WVRSTATE |
string | - |
- Note: Many excluded providers do not have NPIs. Match on NPI where available, fall back to name+state matching.
File 3: NPPES NPI Registry
- Download:
https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2026_V2.zip(~1 GB zipped) - Required columns (from 329 available):
| Column | Purpose |
|---|---|
NPI |
Join key |
Entity Type Code |
1=Individual, 2=Organization |
Provider Organization Name (Legal Business Name) |
Entity name |
Provider Last Name (Legal Name) |
Individual name |
Provider First Name |
Individual name |
Provider Business Practice Location Address State Name |
State |
Provider Business Practice Location Address Postal Code |
ZIP |
Healthcare Provider Taxonomy Code_1 |
Provider specialty |
Provider Enumeration Date |
When NPI was first issued |
Authorized Official Last Name |
Person controlling the entity |
Authorized Official First Name |
Person controlling the entity |
REQUIRED FRAUD SIGNALS (6 total, each must be implemented)
Each signal has a precise definition. The tool must flag a provider if ANY of these conditions are true:
Signal 1: Excluded Provider Still Billing
- Definition: A
BILLING_PROVIDER_NPI_NUMorSERVICING_PROVIDER_NPI_NUMin the spending data matches anNPIin the LEIE whereEXCLDATEis before theCLAIM_FROM_MONTHandREINDATEis empty or afterCLAIM_FROM_MONTH - Output: The NPI, exclusion date, exclusion type, and total dollars paid after exclusion date
Signal 2: Billing Volume Outlier
- Definition: For each
BILLING_PROVIDER_NPI_NUM, compute totalTOTAL_PAIDacross all months. Join to NPPES to getHealthcare Provider Taxonomy Code_1. Group all providers by taxonomy code and state. Flag any provider whose total paid is above the 99th percentile of their taxonomy+state peer group - Output: The NPI, their total paid, the peer group median, the peer group 99th percentile, and the ratio (provider_total / peer_median)
Signal 3: Rapid Billing Escalation (New Entity)
- Definition: Join spending data to NPPES on NPI. Get
Provider Enumeration Date. For providers enumerated within 24 months before their firstCLAIM_FROM_MONTHin the spending data: compute month-over-monthTOTAL_PAIDgrowth rate for their first 12 months of billing. Flag if any rolling 3-month average growth rate exceeds 200% - Output: The NPI, enumeration date, first billing month, monthly paid amounts for first 12 months, the peak 3-month growth rate
Signal 4: Workforce Impossibility
- Definition: For each
BILLING_PROVIDER_NPI_NUMwhere NPPESEntity Type Code= 2 (organization), compute maxTOTAL_CLAIMSin any single month. Divide by 22 (working days) then by 8 (hours). If the implied claims-per-hour exceeds 6 (i.e., one claim every 10 minutes sustained for every working hour of every working day), flag the provider - Output: The NPI, the peak month, the peak claims count, the implied claims-per-provider-hour, the total paid in that month
Signal 5: Shared Authorized Official Across Multiple NPIs
- Definition: From NPPES, group all NPIs by (
Authorized Official Last Name+Authorized Official First Name). For any authorized official controlling 5 or more NPIs: sumTOTAL_PAIDacross all those NPIs from the spending data. Flag if the combined total exceeds $1,000,000 - Output: The authorized official name, list of all NPIs they control, total paid per NPI, combined total paid
Signal 6: Geographic Implausibility
- Definition: For each
BILLING_PROVIDER_NPI_NUM, get their state from NPPES (Provider Business Practice Location Address State Name). Check if they haveTOTAL_CLAIMS> 100 in any single month for HCPCS codes in the home health range (G0151-G0162, G0299-G0300, S9122-S9124, T1019-T1022). If so, compute theirTOTAL_UNIQUE_BENEFICIARIES/TOTAL_CLAIMSratio for those codes. Flag if ratio is below 0.1 (i.e., fewer than 1 unique patient per 10 claims -- indicating repeated billing on same patients) - Output: The NPI, state, the flagged HCPCS codes, the month, claims count, unique beneficiaries, the ratio
REQUIRED OUTPUT FORMAT
The tool must produce a single file: fraud_signals.json
{
"generated_at": "2026-02-18T12:00:00Z",
"tool_version": "string",
"total_providers_scanned": 0,
"total_providers_flagged": 0,
"signal_counts": {
"excluded_provider": 0,
"billing_outlier": 0,
"rapid_escalation": 0,
"workforce_impossibility": 0,
"shared_official": 0,
"geographic_implausibility": 0
},
"flagged_providers": [
{
"npi": "1234567890",
"provider_name": "string (from NPPES)",
"entity_type": "individual|organization",
"taxonomy_code": "string",
"state": "XX",
"enumeration_date": "YYYY-MM-DD",
"total_paid_all_time": 0.00,
"total_claims_all_time": 0,
"total_unique_beneficiaries_all_time": 0,
"signals": [
{
"signal_type": "excluded_provider|billing_outlier|rapid_escalation|workforce_impossibility|shared_official|geographic_implausibility",
"severity": "critical|high|medium",
"evidence": {}
}
],
"estimated_overpayment_usd": 0.00,
"fca_relevance": {
"claim_type": "string describing the FCA violation pattern",
"statute_reference": "31 U.S.C. section 3729(a)(1)(A|B|C|G)",
"suggested_next_steps": ["string"]
}
}
]
}
Severity rules (no discretion):
critical: Signal 1 (excluded provider) always criticalhigh: Signal 2 with ratio > 5x peer median, Signal 3 with growth > 500%, Signal 4, Signal 5 with combined > $5Mmedium: All other flags
estimated_overpayment_usd calculation:
- Signal 1: total paid after exclusion date
- Signal 2: (provider_total - peer_99th_percentile), floored at 0
- Signal 3: total paid in months where growth exceeded 200%
- Signal 4: (peak_month_claims - (6 * 8 * 22)) * (peak_month_total_paid / peak_month_claims), floored at 0
- Signal 5: not estimated (set to 0)
- Signal 6: not estimated (set to 0)
statute_reference mapping:
- Signal 1:
31 U.S.C. section 3729(a)(1)(A)(presenting false claims -- excluded provider cannot bill) - Signal 2:
31 U.S.C. section 3729(a)(1)(A)(potential overbilling) - Signal 3:
31 U.S.C. section 3729(a)(1)(A)(potential bust-out scheme) - Signal 4:
31 U.S.C. section 3729(a)(1)(B)(false records -- impossible volume implies fabricated claims) - Signal 5:
31 U.S.C. section 3729(a)(1)(C)(conspiracy -- coordinated billing through shell entities) - Signal 6:
31 U.S.C. section 3729(a)(1)(G)(reverse false claims -- repeated billing on same patients)
HARDWARE REQUIREMENTS
Primary target: Single Linux machine with <=200GB RAM, <=1 GPU (H100/H200 class)
Required fallback: Must also run on a MacBook (Apple Silicon, 16GB+ RAM) with no GPU. Performance will be slower but the tool must complete without error.
Constraints:
- No distributed computing frameworks required (no Spark cluster, no multi-node Dask)
- No proprietary cloud services required (no BigQuery, no Snowflake, no Databricks)
- GPU is optional -- if the solution uses GPU (e.g., RAPIDS/cuDF), it must provide a CPU-only fallback via a
--no-gpuflag that works on both Linux and macOS - All dependencies pip-installable from public repositories (no conda-only packages)
- Must work on both Ubuntu 22.04+ and macOS 14+ with Python 3.11+
Performance criteria:
- Linux (200GB RAM, GPU available): completes full run in under 30 minutes
- Linux (64GB RAM, no GPU): completes full run in under 60 minutes
- MacBook (16GB RAM, Apple Silicon, no GPU): completes full run in under 4 hours
REQUIRED DELIVERABLE STRUCTURE
submission/
README.md -- Setup + run instructions
requirements.txt -- Python dependencies (pip installable)
setup.sh -- One command: downloads data, installs deps
run.sh -- One command: produces fraud_signals.json
src/
ingest.py -- Data loading + joining
signals.py -- All 6 signal implementations
output.py -- JSON report generation
tests/
test_signals.py -- Unit tests for each signal with synthetic data
fixtures/ -- Small synthetic test datasets
fraud_signals.json -- Sample output from actual data run
JUDGING CRITERIA (scored by AI judge, all binary or numeric -- 100 points total)
Functional (60 points)
- setup.sh completes without error on Ubuntu 22.04 with Python 3.11+ (5 pts)
- setup.sh completes without error on macOS 14+ Apple Silicon with Python 3.11+ (5 pts)
- run.sh produces fraud_signals.json that validates against the schema above (10 pts)
- Signal 1 implemented and produces >=1 result (verified against LEIE data) (10 pts)
- Signal 2 implemented and produces results with correct percentile math (5 pts)
- Signal 3 implemented and produces results with correct growth rate math (5 pts)
- Signal 4 implemented and produces results with correct threshold math (5 pts)
- Signal 5 implemented and produces results from NPPES authorized official data (5 pts)
- Signal 6 implemented and produces results for home health HCPCS codes (5 pts)
- All estimated_overpayment_usd values follow the formulas above (5 pts)
Testing (15 points)
- pytest tests/ passes with >=6 tests (one per signal) (10 pts)
- Test fixtures contain synthetic data that triggers each signal (5 pts)
Legal Usability (15 points)
- Every flagged provider has all required JSON fields populated (non-null) (5 pts)
- statute_reference correctly maps per the table above (5 pts)
- suggested_next_steps contains >=2 specific steps per flag (e.g., "Verify provider at [address] is operational", "Request claims detail for HCPCS [code] for [month]") (5 pts)
Code Quality (10 points)
- No hardcoded file paths (all paths relative or configurable) (3 pts)
- Handles missing/null NPI values in LEIE without crashing (3 pts)
- Completes full run in under 60 minutes with 64GB RAM, no GPU (4 pts)
WHAT IS NOT SCORED (to prevent ambiguity):
- Code style, comments, docstrings -- not scored
- Number of providers flagged -- not scored (quality of signals matters, not volume)
- UI, visualization, dashboards -- not scored
- Support for formats other than Parquet -- not scored