본문 바로가기

on policy1

[Off-Policy Learning] 개념 Off-Policy LearningOn-policy ⇒ ExploitationLearning fastBut may miss the best policy in a long run결과가 잘 나온 곳 근처만 계속 판다.Off-policy ⇒ ExplorationLearning slowExplore diverse actions for finding the best policy.다양한 곳을 다 판다.E [x^2] Following Laplace Distribution1. 중심 극한 정리 : 가우시안 분포를 따르는 샘플을 추출.2. 이후, 해당 값x는 우리가 원하는 분포(빨간색 그래프) y값 만큼의 가중치를 준다.3. 그럼, 빨간색 분포를 다르는 샘플을 얻을 수 있다.비교적 안정적이고 정확한 Policy가 나온다. 2025. 1. 24.

이전 1 다음

티스토리툴바