Competition: Medicaid Provider Fraud Signal Detection Engine

OBJECTIVE

Build a CLI tool that:

Ingests the HHS Medicaid Provider Spending dataset (2.9GB parquet, 227M rows)
Cross-references providers against the OIG LEIE exclusion list and the NPPES NPI registry
Outputs a structured JSON file containing provider-level fraud signal reports usable by qui tam / FCA lawyers

REQUIRED INPUT DATA (3 files, all free public downloads)

File 1: HHS Medicaid Provider Spending

Download: https://stopendataprod.blob.core.windows.net/datasets/medicaid-provider-spending/2026-02-09/medicaid-provider-spending.parquet (2.9 GB)
Columns (7 total):

Column	Type	Description
`BILLING_PROVIDER_NPI_NUM`	string	10-digit NPI of billing provider
`SERVICING_PROVIDER_NPI_NUM`	string	10-digit NPI of servicing provider
`HCPCS_CODE`	string	Procedure code
`CLAIM_FROM_MONTH`	date	Format: YYYY-MM-01
`TOTAL_UNIQUE_BENEFICIARIES`	int	Count of unique patients
`TOTAL_CLAIMS`	int	Number of claims (minimum 12 per row)
`TOTAL_PAID`	float	USD paid by Medicaid

Coverage: January 2018 through December 2024
Rows with fewer than 12 claims are excluded by HHS

File 2: OIG LEIE Exclusion List

Download: https://oig.hhs.gov/exclusions/downloadables/UPDATED.csv
Columns (18 total):

Column	Type	Max Length
`LASTNAME`	string	20
`FIRSTNAME`	string	15
`MIDNAME`	string	15
`BUSNAME`	string	30
`GENERAL`	string	-
`SPECIALTY`	string	-
`UPIN`	string	-
`NPI`	string	10
`DOB`	string	8
`ADDRESS`	string	30
`CITY`	string	20
`STATE`	string	2
`ZIP`	string	5
`EXCLTYPE`	string	9
`EXCLDATE`	string	8
`REINDATE`	string	8
`WAIVERDATE`	string	8
`WVRSTATE`	string	-

Note: Many excluded providers do not have NPIs. Match on NPI where available, fall back to name+state matching.

File 3: NPPES NPI Registry

Download: https://download.cms.gov/nppes/NPPES_Data_Dissemination_February_2026_V2.zip (~1 GB zipped)
Required columns (from 329 available):

Column	Purpose
`NPI`	Join key
`Entity Type Code`	1=Individual, 2=Organization
`Provider Organization Name (Legal Business Name)`	Entity name
`Provider Last Name (Legal Name)`	Individual name
`Provider First Name`	Individual name
`Provider Business Practice Location Address State Name`	State
`Provider Business Practice Location Address Postal Code`	ZIP
`Healthcare Provider Taxonomy Code_1`	Provider specialty
`Provider Enumeration Date`	When NPI was first issued
`Authorized Official Last Name`	Person controlling the entity
`Authorized Official First Name`	Person controlling the entity

REQUIRED FRAUD SIGNALS (6 total, each must be implemented)

Each signal has a precise definition. The tool must flag a provider if ANY of these conditions are true:

Signal 1: Excluded Provider Still Billing

Definition: A BILLING_PROVIDER_NPI_NUM or SERVICING_PROVIDER_NPI_NUM in the spending data matches an NPI in the LEIE where EXCLDATE is before the CLAIM_FROM_MONTH and REINDATE is empty or after CLAIM_FROM_MONTH
Output: The NPI, exclusion date, exclusion type, and total dollars paid after exclusion date

Signal 2: Billing Volume Outlier

Definition: For each BILLING_PROVIDER_NPI_NUM, compute total TOTAL_PAID across all months. Join to NPPES to get Healthcare Provider Taxonomy Code_1. Group all providers by taxonomy code and state. Flag any provider whose total paid is above the 99th percentile of their taxonomy+state peer group
Output: The NPI, their total paid, the peer group median, the peer group 99th percentile, and the ratio (provider_total / peer_median)

Signal 3: Rapid Billing Escalation (New Entity)

Definition: Join spending data to NPPES on NPI. Get Provider Enumeration Date. For providers enumerated within 24 months before their first CLAIM_FROM_MONTH in the spending data: compute month-over-month TOTAL_PAID growth rate for their first 12 months of billing. Flag if any rolling 3-month average growth rate exceeds 200%
Output: The NPI, enumeration date, first billing month, monthly paid amounts for first 12 months, the peak 3-month growth rate

Signal 4: Workforce Impossibility

Definition: For each BILLING_PROVIDER_NPI_NUM where NPPES Entity Type Code = 2 (organization), compute max TOTAL_CLAIMS in any single month. Divide by 22 (working days) then by 8 (hours). If the implied claims-per-hour exceeds 6 (i.e., one claim every 10 minutes sustained for every working hour of every working day), flag the provider
Output: The NPI, the peak month, the peak claims count, the implied claims-per-provider-hour, the total paid in that month

Signal 5: Shared Authorized Official Across Multiple NPIs

Definition: From NPPES, group all NPIs by (Authorized Official Last Name + Authorized Official First Name). For any authorized official controlling 5 or more NPIs: sum TOTAL_PAID across all those NPIs from the spending data. Flag if the combined total exceeds $1,000,000
Output: The authorized official name, list of all NPIs they control, total paid per NPI, combined total paid

Signal 6: Geographic Implausibility

Definition: For each BILLING_PROVIDER_NPI_NUM, get their state from NPPES (Provider Business Practice Location Address State Name). Check if they have TOTAL_CLAIMS > 100 in any single month for HCPCS codes in the home health range (G0151-G0162, G0299-G0300, S9122-S9124, T1019-T1022). If so, compute their TOTAL_UNIQUE_BENEFICIARIES / TOTAL_CLAIMS ratio for those codes. Flag if ratio is below 0.1 (i.e., fewer than 1 unique patient per 10 claims -- indicating repeated billing on same patients)
Output: The NPI, state, the flagged HCPCS codes, the month, claims count, unique beneficiaries, the ratio

REQUIRED OUTPUT FORMAT

The tool must produce a single file: fraud_signals.json

{
  "generated_at": "2026-02-18T12:00:00Z",
  "tool_version": "string",
  "total_providers_scanned": 0,
  "total_providers_flagged": 0,
  "signal_counts": {
    "excluded_provider": 0,
    "billing_outlier": 0,
    "rapid_escalation": 0,
    "workforce_impossibility": 0,
    "shared_official": 0,
    "geographic_implausibility": 0
  },
  "flagged_providers": [
    {
      "npi": "1234567890",
      "provider_name": "string (from NPPES)",
      "entity_type": "individual|organization",
      "taxonomy_code": "string",
      "state": "XX",
      "enumeration_date": "YYYY-MM-DD",
      "total_paid_all_time": 0.00,
      "total_claims_all_time": 0,
      "total_unique_beneficiaries_all_time": 0,
      "signals": [
        {
          "signal_type": "excluded_provider|billing_outlier|rapid_escalation|workforce_impossibility|shared_official|geographic_implausibility",
          "severity": "critical|high|medium",
          "evidence": {}
        }
      ],
      "estimated_overpayment_usd": 0.00,
      "fca_relevance": {
        "claim_type": "string describing the FCA violation pattern",
        "statute_reference": "31 U.S.C. section 3729(a)(1)(A|B|C|G)",
        "suggested_next_steps": ["string"]
      }
    }
  ]
}

Severity rules (no discretion):

critical: Signal 1 (excluded provider) always critical
high: Signal 2 with ratio > 5x peer median, Signal 3 with growth > 500%, Signal 4, Signal 5 with combined > $5M
medium: All other flags

estimated_overpayment_usd calculation:

Signal 1: total paid after exclusion date
Signal 2: (provider_total - peer_99th_percentile), floored at 0
Signal 3: total paid in months where growth exceeded 200%
Signal 4: (peak_month_claims - (6 * 8 * 22)) * (peak_month_total_paid / peak_month_claims), floored at 0
Signal 5: not estimated (set to 0)
Signal 6: not estimated (set to 0)

statute_reference mapping:

Signal 1: 31 U.S.C. section 3729(a)(1)(A) (presenting false claims -- excluded provider cannot bill)
Signal 2: 31 U.S.C. section 3729(a)(1)(A) (potential overbilling)
Signal 3: 31 U.S.C. section 3729(a)(1)(A) (potential bust-out scheme)
Signal 4: 31 U.S.C. section 3729(a)(1)(B) (false records -- impossible volume implies fabricated claims)
Signal 5: 31 U.S.C. section 3729(a)(1)(C) (conspiracy -- coordinated billing through shell entities)
Signal 6: 31 U.S.C. section 3729(a)(1)(G) (reverse false claims -- repeated billing on same patients)

HARDWARE REQUIREMENTS

Primary target: Single Linux machine with <=200GB RAM, <=1 GPU (H100/H200 class)

Required fallback: Must also run on a MacBook (Apple Silicon, 16GB+ RAM) with no GPU. Performance will be slower but the tool must complete without error.

Constraints:

No distributed computing frameworks required (no Spark cluster, no multi-node Dask)
No proprietary cloud services required (no BigQuery, no Snowflake, no Databricks)
GPU is optional -- if the solution uses GPU (e.g., RAPIDS/cuDF), it must provide a CPU-only fallback via a --no-gpu flag that works on both Linux and macOS
All dependencies pip-installable from public repositories (no conda-only packages)
Must work on both Ubuntu 22.04+ and macOS 14+ with Python 3.11+

Performance criteria:

Linux (200GB RAM, GPU available): completes full run in under 30 minutes
Linux (64GB RAM, no GPU): completes full run in under 60 minutes
MacBook (16GB RAM, Apple Silicon, no GPU): completes full run in under 4 hours

REQUIRED DELIVERABLE STRUCTURE

submission/
  README.md              -- Setup + run instructions
  requirements.txt       -- Python dependencies (pip installable)
  setup.sh               -- One command: downloads data, installs deps
  run.sh                 -- One command: produces fraud_signals.json
  src/
    ingest.py            -- Data loading + joining
    signals.py           -- All 6 signal implementations
    output.py            -- JSON report generation
  tests/
    test_signals.py      -- Unit tests for each signal with synthetic data
    fixtures/            -- Small synthetic test datasets
  fraud_signals.json     -- Sample output from actual data run

JUDGING CRITERIA (scored by AI judge, all binary or numeric -- 100 points total)

Functional (60 points)

setup.sh completes without error on Ubuntu 22.04 with Python 3.11+ (5 pts)
setup.sh completes without error on macOS 14+ Apple Silicon with Python 3.11+ (5 pts)
run.sh produces fraud_signals.json that validates against the schema above (10 pts)
Signal 1 implemented and produces >=1 result (verified against LEIE data) (10 pts)
Signal 2 implemented and produces results with correct percentile math (5 pts)
Signal 3 implemented and produces results with correct growth rate math (5 pts)
Signal 4 implemented and produces results with correct threshold math (5 pts)
Signal 5 implemented and produces results from NPPES authorized official data (5 pts)
Signal 6 implemented and produces results for home health HCPCS codes (5 pts)
All estimated_overpayment_usd values follow the formulas above (5 pts)

Testing (15 points)

pytest tests/ passes with >=6 tests (one per signal) (10 pts)
Test fixtures contain synthetic data that triggers each signal (5 pts)

Legal Usability (15 points)

Every flagged provider has all required JSON fields populated (non-null) (5 pts)
statute_reference correctly maps per the table above (5 pts)
suggested_next_steps contains >=2 specific steps per flag (e.g., "Verify provider at [address] is operational", "Request claims detail for HCPCS [code] for [month]") (5 pts)

Code Quality (10 points)

No hardcoded file paths (all paths relative or configurable) (3 pts)
Handles missing/null NPI values in LEIE without crashing (3 pts)
Completes full run in under 60 minutes with 64GB RAM, no GPU (4 pts)

WHAT IS NOT SCORED (to prevent ambiguity):

Code style, comments, docstrings -- not scored
Number of providers flagged -- not scored (quality of signals matters, not volume)
UI, visualization, dashboards -- not scored
Support for formats other than Parquet -- not scored

Competition: Medicaid Provider Fraud Signal Detection Engine — 1000N Prize

Description