{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Bab 12 — Reinforcement Learning Lab\n",
        "Notebook ini menjalankan demo kecil dari `rl_playground.py`: bandit sales, SARSA/Q-learning, Mini Pong, policy gradient, soft pricing, dan toy RLHF."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from rl_playground import (\n",
        "    self_test, run_demo, sales_bandit_demo, gridworld_demo, mini_pong_demo,\n",
        "    policy_gradient_bandit_demo, soft_pricing_bandit_demo, toy_rlhf_preference_demo\n",
        ")\n",
        "self_test()\n",
        "print('self-test passed')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Lab 1 — Sales bandit\n",
        "Bandingkan epsilon-greedy dan UCB untuk promo."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "sales_bandit_demo(seed=124)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Lab 2 — Gridworld SARSA vs Q-learning"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "grid = gridworld_demo(seed=125)\n",
        "grid"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print('Policy Q-learning:')\n",
        "print('\\n'.join(grid['q_learning']['policy']))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Lab 3 — Mini Pong"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "mini_pong_demo(seed=126)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Lab 4 — Policy gradient bandit"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "policy_gradient_bandit_demo(seed=127)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Lab 5 — Soft/SAC-style pricing"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "soft_pricing_bandit_demo(seed=128, temperature=0.7)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Lab 6 — Toy RLHF preference model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "toy_rlhf_preference_demo(seed=129)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Semua hasil sekaligus"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "results = run_demo(seed=123)\n",
        "results.keys()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}