Skip to main content

Reward Functions in RL for Algorithmic Trading

Comprehensive Research Synthesis


Key Findings

Reward function design is the dominant driver of whether RL trading agents converge to economically meaningful behavior. The empirical evidence from NeurIPS, IEEE TNNLS, AAAI, CIKM, and open-source frameworks (FinRL) converges on a practical "reward ladder":

  1. Start with a dense, cost-aware return proxy (log-return or PnL delta after realistic costs)
  2. Scale/normalize for stable learning (reward_scaling, log domain)
  3. Add one risk term (differential Sharpe, drawdown penalty, or auxiliary risk task)
  4. Validate under strict walk-forward protocol with turnover/capacity constraints
  5. Advanced: Layer hierarchical modules, train-time shaping, or self-rewarding mechanisms

Comprehensive Reward Function Summary Table

#Reward FamilyMathematical FormMechanismRisk TermCost ModelEmpirical Evidence
1PnL Delta + Turnover Penalty (Deng et al., IEEE TNNLS 2017)$rt = a{t-1}·Δpt − δ|a_t − a{t-1}|$Dense per-step profit minus position-flip cost; a ∈ {−1,0,1}Optional SR objective variantProportional via δ on position changeFutures (IF/AG/SU); SR objective recommended for reliability
2Portfolio Value Delta + Scaling (FinRL StockTradingEnv)rt = (V_t − V{t-1}) × reward_scalingPer-step total asset change with explicit scaling for gradient stabilityTurbulence threshold gating; forced liquidationProportional buy/sell cost ratesGym-style env; widely adopted baseline
3Log-Return + Remainder Factor (Jiang et al., 2017)rt = ln(V_t/V{t-1}) with remainder factor μLog-domain compression; dense per-periodNone in reward; Sharpe & MDD ex postProportional commission via μ; zero slippage assumedPoloniex crypto; 0.25% commission; reports fAPV, Sharpe, MDD
4Differential Sharpe Ratio (Moody & Saffell, NeurIPS 1998)D_t = (B·ΔA − 0.5·A·ΔB) / (B − A²)^{3/2}Online incremental approximation of global SharpeYes: Sharpe by constructionδ integrated into returnMonthly S&P 500 (1950–94); Sharpe > buy-and-hold at 0.5% cost
5Negative Max Drawdown (DeepTrader, AAAI 2021)r_risk = −MDD (risk-scaling module)Hierarchical: return for asset scoring; −MDD for risk scalingYes: drawdown as first-class objectiveUnspecified; shorting constraintsDJIA/HSI/CSI100; DT-MDD materially lower MDD than DT-RoR
6Hindsight Bonus Shaping (DeepScalper, CIKM 2022)rtrain = r_base + λ·Σ r{t+k}Train-time shaping; eval uses base reward onlyAuxiliary volatility prediction taskFee rates per futures contract1-min OHLCV+LOB; 6 futures; best on TR%, Sharpe, Calmar
7Implementation Shortfall (Hendricks & Wilcox, 2014)IS = exec_cost vs arrival_price benchmarkMinimizes execution cost for optimal liquidationNot in reward; variance increase notedLOB depth traversal; Almgren-ChrissSA equities 5-min; ~10% IS improvement for short horizons
8Sortino RatioSR = E[r−r_f] / σ_downsidePenalizes only downside deviationYes: asymmetric downside focusVariesOutperforms Sharpe in extended action spaces; +856.7% with 21 actions
9Calmar RatioCR = Ann. Return / Max DrawdownStep-wise normalization via log-return in Actor-CriticYes: drawdown denominatorVariesA2C achieves highest Calmar (5.16) among PPO/DQN
10PnL − Inventory Penalty (Market Making)r = PnL − φ·q²Balances spread capture vs inventory riskYes: quadratic inventory penaltyLOB execution; AIIF tuningSimulated LOB; Sharpe 31.54 in 60 episodes
11Self-Rewarding DRL (SRDRL, 2024)r = max(r_expert, r_predicted)Secondary network generates adaptive rewardEmbedded in expert extractionVariesSRDDQN NASDAQ: 1124.23% vs 51.87% static baseline
12Composite Multi-ObjectiveR = w₁·ret + w₂·entropy + w₃·corr + ...Dynamic Lagrangian weightingMultiple risk axesL1 turnover + L2 slippageImproves stability; natural regularization
13Factor-Beta Alignment (FDRL)r = α·Σ(β_p − β_bm)² weightedAnchors agent to macro factor exposuresFactor deviation penaltyVariesEquities: Sharpe 1.04→1.27; MDD ~18%
14Regret Minimizationr = −(perf_expert − perf_agent)Minimizes gap vs benchmarkImplicit via relative performanceVariesBetter OOS; more robust across regimes
15ESG-Adjusted RatiosAdjusted ΔSR/ΔSortino with ESG weightingNon-financial ESG factors in risk-adjusted rewardSharpe/Sortino + ESGVariesSocially responsible portfolio optimization

Algorithm Performance Comparison (Risk-Adjusted Rewards)

AlgorithmCumulative ReturnSharpeSortinoCalmarKey Strength
PPO~62% (highest)1.243.33ModerateBest raw returns; robust exploration via clipping
A2C~42%2.115.475.16Superior risk-adjusted metrics; best for Calmar/Sortino
DQNLowest1.823.02ModerateHighest Omega (55.31) & Profit Factor (4.02); best for discrete actions

Critical Design Decision Flowchart

Define Trading Task (portfolio / directional / execution)

├─→ Choose Base Reward
│ ├─ Portfolio/Directional → ΔPnL or Log-Return
│ └─ Execution → −Implementation Shortfall

├─→ Add Trading Frictions
│ ├─ Proportional fees + spread + slippage
│ └─ LOB depth / market impact (execution)

├─→ Add Risk Preference (pick ONE)
│ ├─ Differential Sharpe Ratio
│ ├─ Drawdown penalty / −MDD
│ └─ Auxiliary risk task (volatility prediction)

├─→ Shaping & Normalization
│ ├─ Dense per-step rewards
│ ├─ Train-time shaping (hindsight bonus)
│ ├─ Reward scaling factor
│ └─ Log-return domain

└─→ Evaluation Protocol
├─ Walk-forward OOS + realistic costs
├─ Turnover / capacity metrics
└─ Sensitivity to fee/slippage assumptions

Key Limitations

  • Live trading evidence is scarce — most validation is replay/backtesting with strong assumptions (zero slippage)
  • Sharpe optimization can be gamed — agents suppress volatility to hide tail risk
  • Execution rewards increase variance — optimizing mean IS alone raises execution-risk variability
  • Non-stationarity — any static reward function eventually decays
  • Regulatory risk — unconstrained agents may discover manipulative strategies (spoofing) to maximize reward
  • Overfitting — many models with high in-sample accuracy fail in live deployment