中山大学管理学院,广东 广州 510275
马帅(1987年生),男;研究方向:马氏决策过程、风险决策;E-mail:mash35@mail.sysu.edu.cn
夏俐(1980年生),男;研究方向:随机学习与优化、马氏决策过程和强化学习等; E-mail:xiali5@mail.sysu.edu.cn
纸质出版日期:2023-01-25,
网络出版日期:2022-10-20,
收稿日期:2022-03-09,
录用日期:2022-06-12
扫 描 看 全 文
马帅,夏俐.风险敏感马氏决策过程与状态扩充变换[J].中山大学学报(自然科学版),2023,62(01):181-191.
MA Shuai,XIA Li.Risk-sensitive Markov decision processes and state augmentation transformation[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2023,62(01):181-191.
马帅,夏俐.风险敏感马氏决策过程与状态扩充变换[J].中山大学学报(自然科学版),2023,62(01):181-191. DOI: 10.13471/j.cnki.acta.snus.2022A020.
MA Shuai,XIA Li.Risk-sensitive Markov decision processes and state augmentation transformation[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2023,62(01):181-191. DOI: 10.13471/j.cnki.acta.snus.2022A020.
在马氏决策过程中,过程的随机性由策略与转移核决定,优化目标的随机性受随机报酬与随机策略的影响,其中随机报酬往往可通过简化转化为确定型报酬。当优化准则为经典的期望类准则,如平均准则或折扣准则时,报酬函数的简化不会影响优化结果。然而对风险敏感的优化准则,此类简化将影响风险目标值,进而破坏策略的最优性。针对该问题,状态扩充变换将随机信息重组进扩充状态空间,在简化报酬函数的同时保持随机报酬过程不变。本文以三种定义于累积折扣报酬的经典风险测度为例,在策略评价中对比报酬函数简化与状态扩充变换对风险评估的影响。理论验证与数值实验均表明,当报酬函数形式较为复杂时,状态扩充变换可在简化报酬函数的同时保持风险测度不变。
In the theory of Markov decision processes, the randomness of the objective stems from not only the stochasticity of the process but also the randomnesses of the one-step reward and the policy. When the optimality criterion concerns only the risk-neutral expectation of the objective, the reward (function) simplification will not affect the optimization result. However, the simplification will change the stochastic reward sequence, which results in a modification to a risk-sensitive objective, i.e., a risk measure. Since some theoretical methods may require a simple reward function in a practical environment with a complicated one, to bridge this gap, we propose a technique termed state augmentation transformation, which preserves the stochastic reward sequence in a transformed process with a reward function in a simple form. Taking three classical risk measures (variance, exponential utility, and conditional value at risk) for example, the numerical experiment shows that the state augmentation transformation keeps the risk measures intact, while the reward simplification fails.
马氏决策过程状态扩充变换风险报酬函数简化
Markov decision processstate augmentation transformationriskreward simplification
PUTERMAN M L. Markov decision processes: Discrete stochastic dynamic programming [M]. New Jersey: John Wiley & Sons, 2014.
刘克,曹平. 马尔可夫决策过程理论与应用 [M]. 北京: 科学出版社, 2015.
DULLERUD G E, PAGANINI F. A course in robust control theory: A convex approach [M]. New York:Springer Science & Business Media, 2013.
RUSZCZYŃSKI A. Risk-averse dynamic programming for Markov decision processes [J]. Math Program, 2010, 125(2): 235-261.
BÄUERLE N, GLAUNER A. Markov decision processes with recursive risk measures [J]. European J Oper Res, 2022, 296(3): 953-966.
SOBEL M J. The variance of discounted Markov decision processes [J]. J Appl Probab, 1982, 19(4): 794-802.
MANNOR S, TSITSIKLIS J. Mean-variance optimization in Markov decision processes [C]// In Proceedings of the 28th International Conference on Machine Learning, 2011: 177-184.
TAMAR A, DI CASTRO D, MANNOR S. Policy gradients with variance related risk criteria [C]// In Proceedings of the 29th International Conference on Machine Learning, 2012: 1651-1658.
XIE T, LIU B, XU Y, et al. A block coordinate ascent algorithm for mean-variance optimization [C]// In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018: 1073-1083.
HERNÁNDEZ-LERMA O, VEGA-AMAYA O, CARRASCO G. Sample-path optimality and variance-minimization of average cost Markov control processes [J]. SIAM J Control Optim, 1999, 38(1):79-93.
GUO X, SONG X. Mean-variance criteria for finite continuous-time Markov decision processes [J]. IEEE Trans Automat Control, 2009, 54(9): 2151-2157.
SOBEL M J. Mean-variance tradeoffs in an undiscounted MDP [J]. Oper Res, 1994, 42(1): 175-183.
CHUNG K J. Mean-variance tradeoffs in an undiscounted MDP: The unichain case [J]. Oper Res, 1994, 42(1): 184-188.
PRASHANTH L A, GHAVAMZADEH M. Actor-critic algorithms for risk-sensitive MDPs [C]//In Proceedings of the Neural Information Processing Systems, 2013: 252-260.
GOSAVI A. Variance-penalized Markov decision processes: Dynamic programming and reinforcement learning techniques [J]. Int J Gen Syst, 2014, 43(6): 649-669.
XIA L. Optimization of Markov decision processes under the variance criterion [J]. Automatica, 2016, 73: 269-278.
MA S, MA X, XIA L. A unified algorithm framework for mean-variance optimization in discounted Markov decision processes [OL]. arXiv:2201.05737, 2022.
MARKOWITZ H. Portfolio selection [J]. J Finance, 1952, 7(1): 77-91.
KOUVELIS P, PANG Z, DING Q. Integrated commodity inventory management and financial hedging: A dynamic mean-variance analysis [J]. Prod Oper Manag, 2018, 27(6): 1052-1073.
DONG M, LI Y, SONG D, et al. Uncertainty and global sensitivity analysis of levelized cost of energy in wind power generation [J]. Energy Convers Manag, 2021, 229: 113781.
YAO Z, XU T, JIANG Y, et al. Linear stability analysis of heterogeneous traffic flow considering degradations of connected automated vehicles and reaction time [J]. Phys A, 2021, 561: 125218.
HARRISON C A, QIN S J. Minimum variance performance map for constrained model predictive control [J]. J Process Control, 2009, 19(7): 1199-1204.
MORGENSTERN O, von NEUMANN J. Theory of games and economic behavior [M]. New Jersey: Princeton University Press, 1953.
ALLAIS M. Le comportement de l'homme rationnel devant le risque: Critique des postulates et axiomes de l'ecole Americaine [J]. Econometrica, 1953, 21: 503-546.
BERNOULLI D. Exposition of a new theory on the measurement of risk [J]. Econometrica, 1954, 22: 23-36.
HOWARD R A, MATHESON J E. Risk-sensitive Markov decision processes [J]. Manag Sci, 1972, 18(7): 356-369.
CHUNG K J, SOBEL M J. Discounted MDP's: Distribution functions and exponential utility maximization [J]. SIAM J Control Optim, 1987, 25(1): 49-62.
BÄUERLE N, RIEDER U. More risk-sensitive Markov decision processes [J]. Math Oper Res, 2014, 39(1): 105-120.
ZHANG Y. Continuous-time Markov decision processes with exponential utility [J]. SIAM J Control Optim, 2017, 55(4): 2636-2660.
SPEYER J. An adaptive terminal guidance scheme based on an exponential cost criterion with application to homing missile guidance [J]. IEEE Trans Automat Control, 1976, 21(3): 371-375.
AVILA-GODOY G, FERNÁNDEZ-GAUCHERAND E. Controlled Markov chains with exponential risk-sensitive criteria: Modularity, structured policies and applications [C]// In Proceedings of the IEEE Conference on Decision and Control, 1998: 778-783.
BREZAS P, SMITH M C. Linear quadratic optimal and risk-sensitive control for vehicle active suspensions [J]. IEEE Trans Control Syst Technol, 2013, 22(2): 543-556.
FILAR J A, KRASS D, ROSS K W, et al. Percentile performance criteria for limiting average Markov decision processes [J]. IEEE Trans Automat Control, 1995, 40: 2-10.
WU C, LIN Y. Minimizing risk models in Markov decision processes with policies depending on target values [J]. J Math Anal Appl, 1999, 231(1): 47-67.
ROCKAFELLAR R, URYASEV S. Optimization of conditional value-at-risk [J]. J Risk, 2000, 2: 21-42.
ARTZNER P, DELBAEN F,EBER J M, et al. Coherent measures of risk [J]. Math Finance, 1999, 9(3): 203-228.
BORKAR V, JAIN R. Risk-constrained Markov decision processes [J]. IEEE Trans of Automat Control, 2014, 59(9): 2574-2579.
BÄUERLE N, OTT J. Markov decision processes with average-value-at-risk criteria [J]. Math Methods Oper Res, 2011, 74(3): 361-379.
HASKELL W, JAIN R. A convex analytic approach to risk-aware Markov decision processes [J]. SIAM J Control Optim, 2014, 53(3): 1569-1598.
PRASHANTH L A. Policy gradients for CVaR-constrained MDPs [C]// In Proceedings of the International Conference on Algorithmic Learning Theory, 2014: 155-169.
CHOW Y, TAMAR A, MANNOR S,et al. Risk-sensitive and robust decision-making: A CVaR optimization approach [C]// In Proceedings of the Neural Information Processing Systems, 2015: 1522-1530.
ZHU B, WEN B, JI S, et al. Coordinating a dual-channel supply chain with conditional value-at-risk under uncertainties of yield and demand [J]. Comput Ind Eng, 2020, 139: 106181.
ROUSTAI M, RAYATI M, SHEIKHI A, RANJBAR A. A scenario-based optimization of smart energy hub operation in a stochastic environment using conditional-value-at-risk [J]. Sustain Cities Soc, 2018, 39: 309-316.
HOSSEINI S D, VERMA M. Conditional value-at-risk (CVaR) methodology to optimal train configuration and routing of rail hazmat shipments [J]. Transportation Res Part B, 2018, 110: 79-103.
HE F, CHAUSSALET T, QU R. Controlling understaffing with conditional Value-at-Risk constraint for an integrated nurse scheduling problem under patient demand uncertainty [J]. Oper Res Perspect, 2019, 6: 100119.
FILIPPI C, GUASTAROBA G, SPERANZA M G. Conditional value-at-risk beyond finance: A survey [J]. Int Trans Oper Res, 2020, 27(3): 1277-1319.
BORKAR V S. Q-learning for risk-sensitive control [J]. Math Oper Res, 2002, 27(2): 294-311.
SHEN Y, TOBIA M J, SOMMER T, et al. Risk-sensitive reinforcement learning [J]. Neural Comput, 2014, 26(7): 1298-1328.
GARCÍA J, FERNÁNDEZ F. A comprehensive survey on safe reinforcement learning [J]. J Mach Learn Res, 2015, 16(1): 1437-1480.
HUANG W, HASKELL W. B. Risk-aware Q-learning for Markov decision processes [C]// In Proceedings of the IEEE Conference on Decision and Control, 2017: 4928-4933.
CHOW Y, GHAVAMZADEH M, JANSON L,et al. Risk-constrained reinforcement learning with percentile risk criteria [J]. J Mach Learn Res, 2017, 18(1): 6070-6120.
NORTON M, KHOKHLOV V, URYASEV S. Calculating CVaR and bPOE for common probability distributions with application to portfolio optimization and density estimation [J]. Ann Oper Res, 2021, 299(1): 1281-1315.
MA S, YU J Y. State-augmentation transformations for risk-sensitive reinforcement learning [C]// In Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 4512-4519.
MA S, YU J Y. Transition-based versus state-based reward functions for MDPs with value-at-risk [C]// In Proceedings of the Annual Allerton Conference on Communication, Control, and Computing, 2017: 974-981.
MA S, YU J Y. Variance-based risk estimations in Markov processes via transformation with state lumping [C]// In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 2019: 958-963.
DURBIN J. Distribution theory for tests based on the sample distribution function [M]. Philadelphia: Society for Industrial and Applied Mathematics, 1973.
0
浏览量
1
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构