| 1 | PnL Delta + Turnover Penalty (Deng et al., IEEE TNNLS 2017) | $rt = a{t-1}·Δpt − δ|a_t − a{t-1}|$ | Dense per-step profit minus position-flip cost; a ∈ {−1,0,1} | Optional SR objective variant | Proportional via δ on position change | Futures (IF/AG/SU); SR objective recommended for reliability |
| 2 | Portfolio Value Delta + Scaling (FinRL StockTradingEnv) | rt = (V_t − V{t-1}) × reward_scaling | Per-step total asset change with explicit scaling for gradient stability | Turbulence threshold gating; forced liquidation | Proportional buy/sell cost rates | Gym-style env; widely adopted baseline |
| 3 | Log-Return + Remainder Factor (Jiang et al., 2017) | rt = ln(V_t/V{t-1}) with remainder factor μ | Log-domain compression; dense per-period | None in reward; Sharpe & MDD ex post | Proportional commission via μ; zero slippage assumed | Poloniex crypto; 0.25% commission; reports fAPV, Sharpe, MDD |
| 4 | Differential Sharpe Ratio (Moody & Saffell, NeurIPS 1998) | D_t = (B·ΔA − 0.5·A·ΔB) / (B − A²)^{3/2} | Online incremental approximation of global Sharpe | Yes: Sharpe by construction | δ integrated into return | Monthly S&P 500 (1950–94); Sharpe > buy-and-hold at 0.5% cost |
| 5 | Negative Max Drawdown (DeepTrader, AAAI 2021) | r_risk = −MDD (risk-scaling module) | Hierarchical: return for asset scoring; −MDD for risk scaling | Yes: drawdown as first-class objective | Unspecified; shorting constraints | DJIA/HSI/CSI100; DT-MDD materially lower MDD than DT-RoR |
| 6 | Hindsight Bonus Shaping (DeepScalper, CIKM 2022) | rtrain = r_base + λ·Σ r{t+k} | Train-time shaping; eval uses base reward only | Auxiliary volatility prediction task | Fee rates per futures contract | 1-min OHLCV+LOB; 6 futures; best on TR%, Sharpe, Calmar |
| 7 | Implementation Shortfall (Hendricks & Wilcox, 2014) | IS = exec_cost vs arrival_price benchmark | Minimizes execution cost for optimal liquidation | Not in reward; variance increase noted | LOB depth traversal; Almgren-Chriss | SA equities 5-min; ~10% IS improvement for short horizons |
| 8 | Sortino Ratio | SR = E[r−r_f] / σ_downside | Penalizes only downside deviation | Yes: asymmetric downside focus | Varies | Outperforms Sharpe in extended action spaces; +856.7% with 21 actions |
| 9 | Calmar Ratio | CR = Ann. Return / Max Drawdown | Step-wise normalization via log-return in Actor-Critic | Yes: drawdown denominator | Varies | A2C achieves highest Calmar (5.16) among PPO/DQN |
| 10 | PnL − Inventory Penalty (Market Making) | r = PnL − φ·q² | Balances spread capture vs inventory risk | Yes: quadratic inventory penalty | LOB execution; AIIF tuning | Simulated LOB; Sharpe 31.54 in 60 episodes |
| 11 | Self-Rewarding DRL (SRDRL, 2024) | r = max(r_expert, r_predicted) | Secondary network generates adaptive reward | Embedded in expert extraction | Varies | SRDDQN NASDAQ: 1124.23% vs 51.87% static baseline |
| 12 | Composite Multi-Objective | R = w₁·ret + w₂·entropy + w₃·corr + ... | Dynamic Lagrangian weighting | Multiple risk axes | L1 turnover + L2 slippage | Improves stability; natural regularization |
| 13 | Factor-Beta Alignment (FDRL) | r = α·Σ(β_p − β_bm)² weighted | Anchors agent to macro factor exposures | Factor deviation penalty | Varies | Equities: Sharpe 1.04→1.27; MDD ~18% |
| 14 | Regret Minimization | r = −(perf_expert − perf_agent) | Minimizes gap vs benchmark | Implicit via relative performance | Varies | Better OOS; more robust across regimes |
| 15 | ESG-Adjusted Ratios | Adjusted ΔSR/ΔSortino with ESG weighting | Non-financial ESG factors in risk-adjusted reward | Sharpe/Sortino + ESG | Varies | Socially responsible portfolio optimization |