검색

[강화학습] MADDPG

최종 수정일: 2024년 6월 5일

핵심 내용:

기존 DQN, PG, DDPG 같은 알고리즘은 partially observable Markov Game을 제대로 modelling 하지 못함.
MDP의 방식은 다른 에이전트의 액션을 고려하지 않기 때문.
고로 Centralized critic, decentralized actor 도입, in order to take other agents' actions into account.

MLA 핵심: KV 대신 head-agnostic한 latent vector c로 압축해버림. 그리고 연산시 다시 원래 차원으로 매핑. vanilla MHA에 비해서 성능 저하 없음 (왜??) MoE: 기존 MoE과 달리 256개의 expert...

original link CORE IDEA: Explicitly pre-define some "thought-factors" in human reasoning ( "Decompose Problem," "Restate Objective,"...

Migrated here Let’s Verify Step by Step ( PRM800k dataset ) Core idea: PRM(process reward model) works better than ORM(outcome reward...

Comments