Rainbow Supermarket · POC 2025 · Store 常兴天虹00110

SKU-Level
Demand Intelligence

An ensemble of XGBoost, LightGBM, and Random Forest forecasting daily sales quantity for 20,000+ SKUs across five product categories, trained on 2+ years of transaction history with 24 base features (+6 auto-discovered) including promotions.

Explore Forecasts → Evaluate Accuracy

20,372

Eligible SKUs

5

Forecasting Models

Three machine-learning models — two gradient boosting, one bagging ensemble — plus a seasonal-naïve baseline. The tree models share the same 24 base + 6 auto-discovered features and use recursive multi-step forecasting over the 46-day horizon. Final submission uses the mean of the three.

LightGBM Global · All SKUs

Gradient boosted trees with leaf-wise growth and GOSS sampling. At each round a new tree fits the residual gradient, iteratively correcting prior mistakes. Fast on large datasets with sparse features.

ŷ = Σ_k η · f_k(x) (η=0.05, K=200)

Gain = ½[G_L²/(H_L+λ) + G_R²/(H_R+λ) − (G_L+G_R)²/(H_L+H_R+λ)] − γ

200 trees depth 6 63 leaves GOSS sampling

XGBoost Global · All SKUs

Gradient boosted trees with level-wise growth and second-order Taylor approximation of the loss. Regularisation via L1/L2 penalties on leaf weights prevents overfitting on noisy SKU-level signals.

obj = Σ_i[g_if(x_i) + ½h_if(x_i)²] + Ω(f)

Ω(f) = γT + ½λ‖w‖² (L2 on leaf weights)

200 trees depth 6 subsample 0.8 hist method

Random Forest Global · All SKUs

Bagging ensemble: each tree is trained on a bootstrap sample with random feature subsets, then predictions are averaged. Uncorrelated errors across trees reduce variance as 1/B.

ŷ = ¹/_B Σ_b T_b(x) (B=100)

Var(ŷ) = ρσ² + ^1−ρ/_Bσ²

ρ = inter-tree correlation; → 0 as B → ∞

100 trees depth 10 50% row sample 70% col sample

Seasonal Naïve Baseline · time series

Fast day-of-week baseline. For each forecast day it blends the most recent value on that weekday with the 4-week day-of-week mean — capturing weekly seasonality with zero training. Serves as the benchmark all models beat.

ŷ_t = ½ · y_last(dow) + ½ · mean_4wk(dow)

dow = day-of-week of target date t

mean over the prior 28-day window

period m = 7 no training benchmark

Ensemble LGB + XGB + RF

Simple mean of the three tree models — LightGBM, XGBoost, and Random Forest. Mixing two boosting variants with bagging diversifies error sources — when errors are uncorrelated, ensemble variance ∝ 1/K.

ŷ_ens = ⅓(ŷ_LGB + ŷ_XGB + ŷ_RF)

Var(ŷ_ens) ≈ ¹/_K · σ² (K=3)

holds when model errors are uncorrelated

simple mean K = 3 models POC submission

Feature Inputs by Model

Tree models use 24 hand-crafted base features plus 6 auto-discovered features (from a 10,000-feature FunSearch pool) = 30. The seasonal-naïve baseline uses only the raw time series.

Feature group	LightGBM	XGBoost	Rand. Forest	Seas. Naïve
Lag features (lag_1–28) · 7	✓	✓	✓	—
Rolling means (roll7/14/28) · 3	✓	✓	✓	—
Calendar (dow, dom, month) · 3	✓	✓	✓	implicit
Promotion flags · 8	✓	✓	✓	—
Category signal (cat_lag1/roll7) · 2	✓	✓	✓	—
SKU identity (sku_id) · 1	✓	✓	✓	—
Auto-discovered (FunSearch) · 6 from 10,000 candidates	✓	✓	✓	—
Total features	30	30	30	1 (series)

24 hand-crafted base features

Lag · 7

lag_1 lag_2 lag_3
lag_7 lag_14 lag_21 lag_28

Rolling · 3

roll7 · roll14 · roll28
(shift-1, no leakage)

Calendar · 3 + id

day_of_week · day_of_month
month · sku_id

Promo · 8 + cat · 2

is_promo · discount_depth
is_bundle · is_threshold
is_warehouse · is_online
days_since_promo · roll_promo_7
cat_lag1 · cat_roll7

+ 6 auto-discovered features (FunSearch / LLM)

roll7−roll28 · lag7−cat_lag1 · roll7−cat_roll7 · lag1/std(lags) · lag7/lag28 · roll7/lag1 selected from a 10,000-feature candidate pool by genetic search, Kimi LLM, and feature-level GP

Model	CV FA	8802	8803	8804	8805	8807
LightGBM	—	—	—	—	—	—
XGBoost	—	—	—	—	—	—
Random Forest	—	—	—	—	—	—
Seasonal Naïvebaseline	—	—	—	—	—	—
Ensemble (LGB+XGB+RF)	—	—	—	—	—	—

FunSearch · 2,000-SKU study

🤖 LLM Auto Feature Engineering

Can a machine discover better features than the hand-crafted 24? We ran a FunSearch study on a 2,000-SKU sample with a library of up to 10,000 auto-generated candidate features, searched three ways: a genetic algorithm over feature-sets, an LLM-in-the-loop sampler (Kimi), and a feature-level genetic program that kills weak features and breeds survivors via crossover & mutation.

10,000

candidate features
generated

3

search methods
(GA · LLM · feature-GP)

+2.5pp

best proxy FA lift
(LightGBM)

≈ 0pp

on full 20k production
(proxy gain didn't transfer)

LLM auto-FE improvement over each ML model — weekly FA (2,000 SKUs)

The single Kimi-discovered 6-feature set added to base-24, evaluated on every model · per-category & overall FA · 4 validation weeks (2025-06-03 → 06-30), train cutoff 2025-06-02
8802 Category A · 8803 Category B · 8804 Category C · 8805 Hardest · 8807 Category E · Overall = sales-weighted across all five

Model	8802	8803	8804	8805	8807	Overall	Lift
LightGBM	79.1%	81.4%	62.0%	47.0%	63.8%	67.25%	+1.72 pp
XGBoost	74.1%	78.7%	58.5%	45.0%	60.5%	63.62%	+0.71 pp
Random Forest	77.8%	78.6%	55.5%	40.1%	61.9%	63.25%	+0.16 pp
OLS (Ridge)	45.3%	45.4%	44.7%	27.0%	44.3%	42.54%	+26.70 pp

Best feature set written by the LLM

Kimi · +1.72 pp

The exact add_features() the LLM-in-the-loop sampler (Kimi, 30 iterations) produced at its best iteration — a momentum + category-relative + volatility mix it discovered on its own:

feat_weekly_speed = roll7 − roll28

feat_cat_momentum = lag7 − cat_lag1

feat_cat_roll_momentum = roll7 − cat_roll7

feat_lag1_over_vol = lag1 / std(lags)

feat_lag7_over_lag28 = lag7 / lag28

feat_rollmean_over_lag1 = roll7 / lag1

The independent genetic search over the feature library reached a comparable +2.46 pp with the same feature families (dispersion · recency · trend) — two different methods converging to the same ceiling confirms the limit is the problem, not the search.

What we learned

▸ The proxy gain (+2.5 pp on 2,000 SKUs, teacher-forced) did not transfer: integrated into the full 20,372-SKU recursive production CV, the 6 features were ≈ neutral (LightGBM +0.1, RF −0.4 pp).
▸ A bigger library (40 → 946 → 10,000) did not raise the tree ceiling — all three searches converge by generation 3–8.
▸ The feature-level GP (kill weak → crossover/mutate survivors) bred increasingly complex composites — by gen 15, 120/150 elite slots were bred offspring — yet still plateaued at the same ceiling (+1.4 pp).
▸ The real win is for weak/linear models: OLS gains +26.7 pp from the same features because it cannot construct interactions itself.
▸ The recursive lift concentrates in the hardest categories — 8804 (+1.2 pp) and 8805 (+3.2 pp).

Ceiling insight: the symmetric weekly FA on intermittent retail demand is dominated by weekly seasonality the base lags already capture — limiting feature-engineering headroom.

📐 POC Accuracy Metric — FA Formula

FA = 1 − 2|F − A| / (F + A)

F = weekly forecast total · A = weekly actual total (per SKU per store)

Example: F = 99, A = 83

FA = 1 − 2×|99−83| / (99+83)

= 1 − 2×16 / 182

= 1 − 0.176 = 82.4% ✓

① If F or A < 0 → clamp to 0

② If F = A = 0 → FA = 100%

③ FA is clipped to [0%, 100%]

Why this formula? Rewrite as 1 − |F−A| / ((F+A)/2). The denominator is the midpoint between forecast and actual — so the error term is the symmetric relative deviation. Unlike regular MAPE, it penalises over- and under-forecasting identically, and gives a full [0, 1] range where 0 means catastrophic miss.

Weighted aggregation (3 steps):

1. Sum daily predictions and actuals to weekly totals per SKU (weeks: Jul 1–7, 8–14, 15–21, 22–28)

2. Each (SKU, week) weight = actual sales in prior 28 days ending the day before the evaluation week starts

3. Weighted mean FA → by category (8802–8807) → overall

Agg FA = Σ(FA_sku,wk × w_sku,wk) / Σw_sku,wk

w_sku,wk = SKU sales in 28 days before week starts

POC categories (store 常兴天虹00110)

8802

Category A

8803

Category B

8804

Category C

8805

Hardest

8807

Category E

Prediction Explorer

Upload the forecast submission file or load demo data to visualise daily SKU-level forecasts across all five models.

Checking local server…

Start uvicorn api_server:app --port 8000 for live multi-model predictions

1 Load Forecast Data (upload XLSX/CSV or use demo)

📂

Upload poc_submission_v2.xlsx / .csv

Columns: 日期 · 条码 · 预测销量

or

Accuracy Evaluator

Upload your actual sales file. The FA metric is computed in-browser using the exact POC formula — no data leaves your machine.

⚠️ Load forecast data in the Prediction Explorer first.

1 Upload Actual Sales

Auto-detected columns: 日期 / date · 条码 / 条形码 / barcode · 当天全部销售数量 / 销量 / quantity

📊

Upload actual sales file

CSV · XLSX · TXT

Overall Weighted FA

--

Weekly FA Breakdown (prior-4-week weighted)

Period	Weighted FA	SKUs	Forecast Σ	Actual Σ	Bias

SKU-Level
Demand Intelligence

Forecasting Models

Feature Inputs by Model

24 hand-crafted base features

+ 6 auto-discovered features (FunSearch / LLM)

Walk-Forward CV Results — FA metric on sales quantity

🤖 LLM Auto Feature Engineering

LLM auto-FE improvement over each ML model — weekly FA (2,000 SKUs)

Best feature set written by the LLM

What we learned

📐 POC Accuracy Metric — FA Formula

Prediction Explorer

1 Load Forecast Data (upload XLSX/CSV or use demo)

2 Select SKU & Date Range

3 Select Models

Daily Sales Forecast

Accuracy Evaluator

1 Upload Actual Sales

Weekly FA Breakdown (prior-4-week weighted)

SKU-Level Demand Intelligence

Forecasting Models

Feature Inputs by Model

24 hand-crafted base features

+ 6 auto-discovered features (FunSearch / LLM)

🤖 LLM Auto Feature Engineering

LLM auto-FE improvement over each ML model — weekly FA (2,000 SKUs)

Best feature set written by the LLM

What we learned

📐 POC Accuracy Metric — FA Formula

Prediction Explorer

1 Load Forecast Data (upload XLSX/CSV or use demo)

2 Select SKU & Date Range

3 Select Models

Daily Sales Forecast

Accuracy Evaluator

1 Upload Actual Sales

Weekly FA Breakdown (prior-4-week weighted)

SKU-Level
Demand Intelligence