{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "cba92afd",
      "metadata": {
        "id": "cba92afd"
      },
      "source": [
        "# MFoDL Lab — Retrieval-Augmented Generation (RAG)\n",
        "\n",
        "## Mathematical Foundations and Python Implementation\n",
        "\n",
        "This lab introduces **Retrieval-Augmented Generation (RAG)** from a mathematical and practical perspective.\n",
        "\n",
        "The goal is not to use a large external system, but to understand the internal logic of RAG:\n",
        "\n",
        "```text\n",
        "documents -> chunks -> vectors -> retrieval -> context -> answer\n",
        "```\n",
        "\n",
        "We will build a small RAG-like pipeline in Python using:\n",
        "\n",
        "- text chunking,\n",
        "- TF-IDF vectorization,\n",
        "- cosine similarity,\n",
        "- top-k retrieval,\n",
        "- simple answer generation from retrieved context,\n",
        "- evaluation of retrieval quality.\n",
        "\n",
        "The lab is designed to work without paid APIs and without internet access."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e5144562",
      "metadata": {
        "id": "e5144562"
      },
      "source": [
        "## Learning objectives\n",
        "\n",
        "After this lab, students should be able to:\n",
        "\n",
        "1. Explain what Retrieval-Augmented Generation is.\n",
        "2. Describe the difference between a language model's internal knowledge and external retrieved knowledge.\n",
        "3. Split documents into smaller text chunks.\n",
        "4. Represent chunks and queries as vectors.\n",
        "5. Retrieve relevant chunks using cosine similarity.\n",
        "6. Build a simple RAG pipeline.\n",
        "7. Understand why retrieval quality affects generated answers.\n",
        "8. Identify common limitations of RAG systems, such as hallucination, irrelevant context, and missing information.\n",
        "9. Design simple evaluation metrics for retrieval."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c3c39308",
      "metadata": {
        "id": "c3c39308"
      },
      "source": [
        "# 1. Motivation: why do we need RAG?\n",
        "\n",
        "Large language models can generate fluent text, but they may have limitations:\n",
        "\n",
        "1. They may not know private or local documents.\n",
        "2. Their training data may be outdated.\n",
        "3. They may produce hallucinations.\n",
        "4. They may not cite or ground answers in sources.\n",
        "5. They may be difficult to update without retraining.\n",
        "\n",
        "RAG addresses this by combining two components:\n",
        "\n",
        "```text\n",
        "retrieval + generation\n",
        "```\n",
        "\n",
        "Instead of asking a model to answer only from its internal parameters, we first retrieve relevant external information and provide it as context.\n",
        "\n",
        "A simplified RAG pipeline:\n",
        "\n",
        "```text\n",
        "User question\n",
        "    ↓\n",
        "Retrieve relevant document chunks\n",
        "    ↓\n",
        "Add retrieved chunks to prompt/context\n",
        "    ↓\n",
        "Generate grounded answer\n",
        "```\n",
        "\n",
        "In this lab, we will build the retrieval part carefully and simulate the generation part in a transparent way."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "138ffb21",
      "metadata": {
        "id": "138ffb21"
      },
      "source": [
        "# 2. A small document collection\n",
        "\n",
        "We start with a small artificial knowledge base about machine learning, NLP, transformers, and model evaluation.\n",
        "\n",
        "In real systems, the knowledge base may contain PDF files, technical documentation, legal documents, company reports, scientific papers, internal notes, product manuals, or support knowledge bases."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8dfe26f8",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "8dfe26f8",
        "outputId": "27c4ebe0-cc74-49b5-c678-edd44466a284"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Number of documents: 8\n",
            "doc_1 - Bag of Words\n",
            "doc_2 - TF-IDF\n",
            "doc_3 - Cosine Similarity\n",
            "doc_4 - Word Embeddings\n",
            "doc_5 - Transformers\n",
            "doc_6 - Retrieval-Augmented Generation\n",
            "doc_7 - Model Evaluation\n",
            "doc_8 - Hallucination\n"
          ]
        }
      ],
      "source": [
        "# Cell 1 — Small document collection\n",
        "\n",
        "documents = [\n",
        "    {\n",
        "        \"doc_id\": \"doc_1\",\n",
        "        \"title\": \"Bag of Words\",\n",
        "        \"text\": (\n",
        "            \"Bag of Words is a simple text representation method. \"\n",
        "            \"It ignores word order and counts how many times each word appears in a document. \"\n",
        "            \"The result is a sparse vector with one dimension for each word in the vocabulary. \"\n",
        "            \"Bag of Words is useful as a baseline but does not capture semantic similarity.\"\n",
        "        )\n",
        "    },\n",
        "    {\n",
        "        \"doc_id\": \"doc_2\",\n",
        "        \"title\": \"TF-IDF\",\n",
        "        \"text\": (\n",
        "            \"TF-IDF is a weighting method for text representation. \"\n",
        "            \"It gives high weight to words that are frequent in a document but rare in the whole corpus. \"\n",
        "            \"TF-IDF stands for Term Frequency Inverse Document Frequency. \"\n",
        "            \"It is often used in search engines and classical NLP pipelines.\"\n",
        "        )\n",
        "    },\n",
        "    {\n",
        "        \"doc_id\": \"doc_3\",\n",
        "        \"title\": \"Cosine Similarity\",\n",
        "        \"text\": (\n",
        "            \"Cosine similarity measures the angle between two vectors. \"\n",
        "            \"It is commonly used to compare TF-IDF vectors or embedding vectors. \"\n",
        "            \"A value close to one means that vectors point in a similar direction. \"\n",
        "            \"A value close to zero means that vectors are not very similar.\"\n",
        "        )\n",
        "    },\n",
        "    {\n",
        "        \"doc_id\": \"doc_4\",\n",
        "        \"title\": \"Word Embeddings\",\n",
        "        \"text\": (\n",
        "            \"Word embeddings represent words as dense vectors. \"\n",
        "            \"Words used in similar contexts often receive similar vector representations. \"\n",
        "            \"Embeddings can capture semantic similarity better than Bag of Words. \"\n",
        "            \"Classical examples include Word2Vec, GloVe, and FastText.\"\n",
        "        )\n",
        "    },\n",
        "    {\n",
        "        \"doc_id\": \"doc_5\",\n",
        "        \"title\": \"Transformers\",\n",
        "        \"text\": (\n",
        "            \"Transformers are neural network architectures based on the attention mechanism. \"\n",
        "            \"Attention allows the model to focus on different tokens when building contextual representations. \"\n",
        "            \"Transformers changed NLP because they can model long-range dependencies more effectively than many older architectures.\"\n",
        "        )\n",
        "    },\n",
        "    {\n",
        "        \"doc_id\": \"doc_6\",\n",
        "        \"title\": \"Retrieval-Augmented Generation\",\n",
        "        \"text\": (\n",
        "            \"Retrieval-Augmented Generation, or RAG, combines information retrieval with text generation. \"\n",
        "            \"A retriever first finds relevant documents or chunks. \"\n",
        "            \"Then a generator uses the retrieved context to answer the user's question. \"\n",
        "            \"RAG can reduce hallucinations and allows models to use external knowledge.\"\n",
        "        )\n",
        "    },\n",
        "    {\n",
        "        \"doc_id\": \"doc_7\",\n",
        "        \"title\": \"Model Evaluation\",\n",
        "        \"text\": (\n",
        "            \"Machine learning models should be evaluated on unseen data. \"\n",
        "            \"For classification, common metrics include accuracy, precision, recall, and F1-score. \"\n",
        "            \"For retrieval systems, we can measure whether relevant documents appear in the top retrieved results. \"\n",
        "            \"Evaluation helps compare different model configurations.\"\n",
        "        )\n",
        "    },\n",
        "    {\n",
        "        \"doc_id\": \"doc_8\",\n",
        "        \"title\": \"Hallucination\",\n",
        "        \"text\": (\n",
        "            \"A hallucination occurs when a language model generates information that is not supported by the available evidence. \"\n",
        "            \"In RAG systems, hallucinations may still happen if the retrieved context is irrelevant, incomplete, or ignored by the generator. \"\n",
        "            \"Grounding answers in retrieved sources can reduce but not completely eliminate hallucinations.\"\n",
        "        )\n",
        "    }\n",
        "]\n",
        "\n",
        "print(\"Number of documents:\", len(documents))\n",
        "for doc in documents:\n",
        "    print(doc[\"doc_id\"], \"-\", doc[\"title\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ac1a9bb6",
      "metadata": {
        "id": "ac1a9bb6"
      },
      "source": [
        "# 3. From documents to chunks\n",
        "\n",
        "In RAG systems, documents are usually split into smaller pieces called **chunks**.\n",
        "\n",
        "Why?\n",
        "\n",
        "Large documents may be too long to fit into a model context window. Also, retrieval works better when it can find a specific relevant fragment instead of a whole long document.\n",
        "\n",
        "A chunk may contain one paragraph, several sentences, a fixed number of tokens, or a sliding window over the document.\n",
        "\n",
        "Mathematically:\n",
        "\n",
        "```text\n",
        "Document D -> chunks c1, c2, ..., cn\n",
        "```\n",
        "\n",
        "Later, each chunk will be represented as a vector."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8ff4a878",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 990
        },
        "id": "8ff4a878",
        "outputId": "eb68b217-45ea-4089-df8e-ebf4f57761cc"
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "         chunk_id doc_id                           title  \\\n",
              "0   doc_1_chunk_0  doc_1                    Bag of Words   \n",
              "1   doc_1_chunk_1  doc_1                    Bag of Words   \n",
              "2   doc_1_chunk_2  doc_1                    Bag of Words   \n",
              "3   doc_1_chunk_3  doc_1                    Bag of Words   \n",
              "4   doc_2_chunk_0  doc_2                          TF-IDF   \n",
              "5   doc_2_chunk_1  doc_2                          TF-IDF   \n",
              "6   doc_2_chunk_2  doc_2                          TF-IDF   \n",
              "7   doc_2_chunk_3  doc_2                          TF-IDF   \n",
              "8   doc_3_chunk_0  doc_3               Cosine Similarity   \n",
              "9   doc_3_chunk_1  doc_3               Cosine Similarity   \n",
              "10  doc_3_chunk_2  doc_3               Cosine Similarity   \n",
              "11  doc_3_chunk_3  doc_3               Cosine Similarity   \n",
              "12  doc_4_chunk_0  doc_4                 Word Embeddings   \n",
              "13  doc_4_chunk_1  doc_4                 Word Embeddings   \n",
              "14  doc_4_chunk_2  doc_4                 Word Embeddings   \n",
              "15  doc_4_chunk_3  doc_4                 Word Embeddings   \n",
              "16  doc_5_chunk_0  doc_5                    Transformers   \n",
              "17  doc_5_chunk_1  doc_5                    Transformers   \n",
              "18  doc_5_chunk_2  doc_5                    Transformers   \n",
              "19  doc_6_chunk_0  doc_6  Retrieval-Augmented Generation   \n",
              "20  doc_6_chunk_1  doc_6  Retrieval-Augmented Generation   \n",
              "21  doc_6_chunk_2  doc_6  Retrieval-Augmented Generation   \n",
              "22  doc_6_chunk_3  doc_6  Retrieval-Augmented Generation   \n",
              "23  doc_7_chunk_0  doc_7                Model Evaluation   \n",
              "24  doc_7_chunk_1  doc_7                Model Evaluation   \n",
              "25  doc_7_chunk_2  doc_7                Model Evaluation   \n",
              "26  doc_7_chunk_3  doc_7                Model Evaluation   \n",
              "27  doc_8_chunk_0  doc_8                   Hallucination   \n",
              "28  doc_8_chunk_1  doc_8                   Hallucination   \n",
              "29  doc_8_chunk_2  doc_8                   Hallucination   \n",
              "\n",
              "                                                 text  \n",
              "0   Bag of Words is a simple text representation m...  \n",
              "1   It ignores word order and counts how many time...  \n",
              "2   The result is a sparse vector with one dimensi...  \n",
              "3   Bag of Words is useful as a baseline but does ...  \n",
              "4   TF-IDF is a weighting method for text represen...  \n",
              "5   It gives high weight to words that are frequen...  \n",
              "6   TF-IDF stands for Term Frequency Inverse Docum...  \n",
              "7   It is often used in search engines and classic...  \n",
              "8   Cosine similarity measures the angle between t...  \n",
              "9   It is commonly used to compare TF-IDF vectors ...  \n",
              "10  A value close to one means that vectors point ...  \n",
              "11  A value close to zero means that vectors are n...  \n",
              "12  Word embeddings represent words as dense vectors.  \n",
              "13  Words used in similar contexts often receive s...  \n",
              "14  Embeddings can capture semantic similarity bet...  \n",
              "15  Classical examples include Word2Vec, GloVe, an...  \n",
              "16  Transformers are neural network architectures ...  \n",
              "17  Attention allows the model to focus on differe...  \n",
              "18  Transformers changed NLP because they can mode...  \n",
              "19  Retrieval-Augmented Generation, or RAG, combin...  \n",
              "20  A retriever first finds relevant documents or ...  \n",
              "21  Then a generator uses the retrieved context to...  \n",
              "22  RAG can reduce hallucinations and allows model...  \n",
              "23  Machine learning models should be evaluated on...  \n",
              "24  For classification, common metrics include acc...  \n",
              "25  For retrieval systems, we can measure whether ...  \n",
              "26  Evaluation helps compare different model confi...  \n",
              "27  A hallucination occurs when a language model g...  \n",
              "28  In RAG systems, hallucinations may still happe...  \n",
              "29  Grounding answers in retrieved sources can red...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-505ef2bb-d8a5-48e8-a4a5-cff7682644ee\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>chunk_id</th>\n",
              "      <th>doc_id</th>\n",
              "      <th>title</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>doc_1_chunk_0</td>\n",
              "      <td>doc_1</td>\n",
              "      <td>Bag of Words</td>\n",
              "      <td>Bag of Words is a simple text representation m...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>doc_1_chunk_1</td>\n",
              "      <td>doc_1</td>\n",
              "      <td>Bag of Words</td>\n",
              "      <td>It ignores word order and counts how many time...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>doc_1_chunk_2</td>\n",
              "      <td>doc_1</td>\n",
              "      <td>Bag of Words</td>\n",
              "      <td>The result is a sparse vector with one dimensi...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>doc_1_chunk_3</td>\n",
              "      <td>doc_1</td>\n",
              "      <td>Bag of Words</td>\n",
              "      <td>Bag of Words is useful as a baseline but does ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>doc_2_chunk_0</td>\n",
              "      <td>doc_2</td>\n",
              "      <td>TF-IDF</td>\n",
              "      <td>TF-IDF is a weighting method for text represen...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>doc_2_chunk_1</td>\n",
              "      <td>doc_2</td>\n",
              "      <td>TF-IDF</td>\n",
              "      <td>It gives high weight to words that are frequen...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>doc_2_chunk_2</td>\n",
              "      <td>doc_2</td>\n",
              "      <td>TF-IDF</td>\n",
              "      <td>TF-IDF stands for Term Frequency Inverse Docum...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>doc_2_chunk_3</td>\n",
              "      <td>doc_2</td>\n",
              "      <td>TF-IDF</td>\n",
              "      <td>It is often used in search engines and classic...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>doc_3_chunk_0</td>\n",
              "      <td>doc_3</td>\n",
              "      <td>Cosine Similarity</td>\n",
              "      <td>Cosine similarity measures the angle between t...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>doc_3_chunk_1</td>\n",
              "      <td>doc_3</td>\n",
              "      <td>Cosine Similarity</td>\n",
              "      <td>It is commonly used to compare TF-IDF vectors ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>10</th>\n",
              "      <td>doc_3_chunk_2</td>\n",
              "      <td>doc_3</td>\n",
              "      <td>Cosine Similarity</td>\n",
              "      <td>A value close to one means that vectors point ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>11</th>\n",
              "      <td>doc_3_chunk_3</td>\n",
              "      <td>doc_3</td>\n",
              "      <td>Cosine Similarity</td>\n",
              "      <td>A value close to zero means that vectors are n...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>12</th>\n",
              "      <td>doc_4_chunk_0</td>\n",
              "      <td>doc_4</td>\n",
              "      <td>Word Embeddings</td>\n",
              "      <td>Word embeddings represent words as dense vectors.</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>13</th>\n",
              "      <td>doc_4_chunk_1</td>\n",
              "      <td>doc_4</td>\n",
              "      <td>Word Embeddings</td>\n",
              "      <td>Words used in similar contexts often receive s...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>14</th>\n",
              "      <td>doc_4_chunk_2</td>\n",
              "      <td>doc_4</td>\n",
              "      <td>Word Embeddings</td>\n",
              "      <td>Embeddings can capture semantic similarity bet...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>15</th>\n",
              "      <td>doc_4_chunk_3</td>\n",
              "      <td>doc_4</td>\n",
              "      <td>Word Embeddings</td>\n",
              "      <td>Classical examples include Word2Vec, GloVe, an...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>16</th>\n",
              "      <td>doc_5_chunk_0</td>\n",
              "      <td>doc_5</td>\n",
              "      <td>Transformers</td>\n",
              "      <td>Transformers are neural network architectures ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>17</th>\n",
              "      <td>doc_5_chunk_1</td>\n",
              "      <td>doc_5</td>\n",
              "      <td>Transformers</td>\n",
              "      <td>Attention allows the model to focus on differe...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>18</th>\n",
              "      <td>doc_5_chunk_2</td>\n",
              "      <td>doc_5</td>\n",
              "      <td>Transformers</td>\n",
              "      <td>Transformers changed NLP because they can mode...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>19</th>\n",
              "      <td>doc_6_chunk_0</td>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>Retrieval-Augmented Generation, or RAG, combin...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>20</th>\n",
              "      <td>doc_6_chunk_1</td>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>A retriever first finds relevant documents or ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>21</th>\n",
              "      <td>doc_6_chunk_2</td>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>Then a generator uses the retrieved context to...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>22</th>\n",
              "      <td>doc_6_chunk_3</td>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>RAG can reduce hallucinations and allows model...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>23</th>\n",
              "      <td>doc_7_chunk_0</td>\n",
              "      <td>doc_7</td>\n",
              "      <td>Model Evaluation</td>\n",
              "      <td>Machine learning models should be evaluated on...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>24</th>\n",
              "      <td>doc_7_chunk_1</td>\n",
              "      <td>doc_7</td>\n",
              "      <td>Model Evaluation</td>\n",
              "      <td>For classification, common metrics include acc...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>25</th>\n",
              "      <td>doc_7_chunk_2</td>\n",
              "      <td>doc_7</td>\n",
              "      <td>Model Evaluation</td>\n",
              "      <td>For retrieval systems, we can measure whether ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>26</th>\n",
              "      <td>doc_7_chunk_3</td>\n",
              "      <td>doc_7</td>\n",
              "      <td>Model Evaluation</td>\n",
              "      <td>Evaluation helps compare different model confi...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>27</th>\n",
              "      <td>doc_8_chunk_0</td>\n",
              "      <td>doc_8</td>\n",
              "      <td>Hallucination</td>\n",
              "      <td>A hallucination occurs when a language model g...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>28</th>\n",
              "      <td>doc_8_chunk_1</td>\n",
              "      <td>doc_8</td>\n",
              "      <td>Hallucination</td>\n",
              "      <td>In RAG systems, hallucinations may still happe...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>29</th>\n",
              "      <td>doc_8_chunk_2</td>\n",
              "      <td>doc_8</td>\n",
              "      <td>Hallucination</td>\n",
              "      <td>Grounding answers in retrieved sources can red...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-505ef2bb-d8a5-48e8-a4a5-cff7682644ee')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-505ef2bb-d8a5-48e8-a4a5-cff7682644ee button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-505ef2bb-d8a5-48e8-a4a5-cff7682644ee');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "  <div id=\"id_3deec75d-5147-4131-ad9d-c0ee1aaac7e9\">\n",
              "    <style>\n",
              "      .colab-df-generate {\n",
              "        background-color: #E8F0FE;\n",
              "        border: none;\n",
              "        border-radius: 50%;\n",
              "        cursor: pointer;\n",
              "        display: none;\n",
              "        fill: #1967D2;\n",
              "        height: 32px;\n",
              "        padding: 0 0 0 0;\n",
              "        width: 32px;\n",
              "      }\n",
              "\n",
              "      .colab-df-generate:hover {\n",
              "        background-color: #E2EBFA;\n",
              "        box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "        fill: #174EA6;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate {\n",
              "        background-color: #3B4455;\n",
              "        fill: #D2E3FC;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate:hover {\n",
              "        background-color: #434B5C;\n",
              "        box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "        filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "        fill: #FFFFFF;\n",
              "      }\n",
              "    </style>\n",
              "    <button class=\"colab-df-generate\" onclick=\"generateWithVariable('chunks_df')\"\n",
              "            title=\"Generate code using this dataframe.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "    <script>\n",
              "      (() => {\n",
              "      const buttonEl =\n",
              "        document.querySelector('#id_3deec75d-5147-4131-ad9d-c0ee1aaac7e9 button.colab-df-generate');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      buttonEl.onclick = () => {\n",
              "        google.colab.notebook.generateWithVariable('chunks_df');\n",
              "      }\n",
              "      })();\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "chunks_df",
              "summary": "{\n  \"name\": \"chunks_df\",\n  \"rows\": 30,\n  \"fields\": [\n    {\n      \"column\": \"chunk_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 30,\n        \"samples\": [\n          \"doc_8_chunk_0\",\n          \"doc_4_chunk_3\",\n          \"doc_7_chunk_0\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 8,\n        \"samples\": [\n          \"doc_2\",\n          \"doc_6\",\n          \"doc_1\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 8,\n        \"samples\": [\n          \"TF-IDF\",\n          \"Retrieval-Augmented Generation\",\n          \"Bag of Words\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 30,\n        \"samples\": [\n          \"A hallucination occurs when a language model generates information that is not supported by the available evidence.\",\n          \"Classical examples include Word2Vec, GloVe, and FastText.\",\n          \"Machine learning models should be evaluated on unseen data.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {},
          "execution_count": 2
        }
      ],
      "source": [
        "# Cell 2 — Simple sentence-based chunking\n",
        "\n",
        "import re\n",
        "import pandas as pd\n",
        "\n",
        "\n",
        "def split_into_sentences(text):\n",
        "    \"\"\"\n",
        "    Very simple sentence splitter.\n",
        "    It splits text at '.', '?', or '!'.\n",
        "    This is not a production-quality tokenizer.\n",
        "    \"\"\"\n",
        "    sentences = re.split(r\"(?<=[.!?])\\s+\", text.strip())\n",
        "    sentences = [s.strip() for s in sentences if s.strip()]\n",
        "    return sentences\n",
        "\n",
        "\n",
        "def create_sentence_chunks(documents):\n",
        "    \"\"\"\n",
        "    Create one chunk per sentence.\n",
        "    Each chunk keeps information about the source document.\n",
        "    \"\"\"\n",
        "    chunks = []\n",
        "\n",
        "    for doc in documents:\n",
        "        sentences = split_into_sentences(doc[\"text\"])\n",
        "\n",
        "        for i, sentence in enumerate(sentences):\n",
        "            chunks.append({\n",
        "                \"chunk_id\": f\"{doc['doc_id']}_chunk_{i}\",\n",
        "                \"doc_id\": doc[\"doc_id\"],\n",
        "                \"title\": doc[\"title\"],\n",
        "                \"text\": sentence\n",
        "            })\n",
        "\n",
        "    return chunks\n",
        "\n",
        "\n",
        "chunks = create_sentence_chunks(documents)\n",
        "chunks_df = pd.DataFrame(chunks)\n",
        "chunks_df"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5a00a857",
      "metadata": {
        "id": "5a00a857"
      },
      "source": [
        "## Discussion\n",
        "\n",
        "Sentence-based chunking is simple and interpretable, but it has limitations.\n",
        "\n",
        "A sentence may be too short and lose context. A paragraph may be too long and contain multiple topics.\n",
        "\n",
        "In real RAG systems, chunking strategy is one of the most important design choices.\n",
        "\n",
        "Poor chunking can cause poor retrieval, even if the vector model is good."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5a185fcd",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 175
        },
        "id": "5a185fcd",
        "outputId": "80793282-be80-4b64-b219-58622a778106"
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "         chunk_id doc_id                           title  \\\n",
              "19  doc_6_chunk_0  doc_6  Retrieval-Augmented Generation   \n",
              "20  doc_6_chunk_1  doc_6  Retrieval-Augmented Generation   \n",
              "21  doc_6_chunk_2  doc_6  Retrieval-Augmented Generation   \n",
              "22  doc_6_chunk_3  doc_6  Retrieval-Augmented Generation   \n",
              "\n",
              "                                                 text  \n",
              "19  Retrieval-Augmented Generation, or RAG, combin...  \n",
              "20  A retriever first finds relevant documents or ...  \n",
              "21  Then a generator uses the retrieved context to...  \n",
              "22  RAG can reduce hallucinations and allows model...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-5181ab8f-63a4-4452-ac11-229db0e1d5ca\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>chunk_id</th>\n",
              "      <th>doc_id</th>\n",
              "      <th>title</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>19</th>\n",
              "      <td>doc_6_chunk_0</td>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>Retrieval-Augmented Generation, or RAG, combin...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>20</th>\n",
              "      <td>doc_6_chunk_1</td>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>A retriever first finds relevant documents or ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>21</th>\n",
              "      <td>doc_6_chunk_2</td>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>Then a generator uses the retrieved context to...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>22</th>\n",
              "      <td>doc_6_chunk_3</td>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>RAG can reduce hallucinations and allows model...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5181ab8f-63a4-4452-ac11-229db0e1d5ca')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-5181ab8f-63a4-4452-ac11-229db0e1d5ca button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-5181ab8f-63a4-4452-ac11-229db0e1d5ca');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "summary": "{\n  \"name\": \"chunks_df[chunks_df[\\\"doc_id\\\"] == selected_doc_id]\",\n  \"rows\": 4,\n  \"fields\": [\n    {\n      \"column\": \"chunk_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 4,\n        \"samples\": [\n          \"doc_6_chunk_1\",\n          \"doc_6_chunk_3\",\n          \"doc_6_chunk_0\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          \"doc_6\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          \"Retrieval-Augmented Generation\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 4,\n        \"samples\": [\n          \"A retriever first finds relevant documents or chunks.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {},
          "execution_count": 3
        }
      ],
      "source": [
        "# Cell 3 — TODO: Inspect chunks\n",
        "#\n",
        "# TODO:\n",
        "# 1. Display all chunks from one selected document.\n",
        "# 2. Change `selected_doc_id`.\n",
        "# 3. Check whether sentence-based chunks are informative enough.\n",
        "\n",
        "selected_doc_id = \"doc_6\"\n",
        "\n",
        "chunks_df[chunks_df[\"doc_id\"] == selected_doc_id]"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "9994b78f",
      "metadata": {
        "id": "9994b78f"
      },
      "source": [
        "# 4. Vector representation of chunks\n",
        "\n",
        "To retrieve relevant chunks, we need to represent both chunks and user questions as vectors in the same vector space.\n",
        "\n",
        "In this lab, we use **TF-IDF** as a simple vector representation.\n",
        "\n",
        "The pipeline is:\n",
        "\n",
        "```text\n",
        "chunks -> TF-IDF vectorizer -> chunk vectors\n",
        "query  -> same vectorizer -> query vector\n",
        "```\n",
        "\n",
        "Then we compare query vector with chunk vectors using cosine similarity."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "184d40f6",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "184d40f6",
        "outputId": "8d6618c7-9be6-4ad2-c5f1-b2ef2e0e5c05"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Number of chunks: 30\n",
            "Number of TF-IDF features: 149\n",
            "\n",
            "Example vocabulary terms:\n",
            "['accuracy' 'allows' 'angle' 'answer' 'answers' 'appear' 'appears'\n",
            " 'architectures' 'attention' 'augmented' 'available' 'bag' 'based'\n",
            " 'baseline' 'better' 'building' 'capture' 'changed' 'chunks' 'classical'\n",
            " 'classification' 'close' 'combines' 'common' 'commonly' 'compare'\n",
            " 'completely' 'configurations' 'context' 'contexts']\n"
          ]
        }
      ],
      "source": [
        "# Cell 4 — Vectorize chunks using TF-IDF\n",
        "\n",
        "from sklearn.feature_extraction.text import TfidfVectorizer\n",
        "\n",
        "chunk_texts = chunks_df[\"text\"].tolist()\n",
        "\n",
        "vectorizer = TfidfVectorizer(\n",
        "    lowercase=True,\n",
        "    stop_words=\"english\"\n",
        ")\n",
        "\n",
        "chunk_vectors = vectorizer.fit_transform(chunk_texts)\n",
        "\n",
        "print(\"Number of chunks:\", chunk_vectors.shape[0])\n",
        "print(\"Number of TF-IDF features:\", chunk_vectors.shape[1])\n",
        "\n",
        "print(\"\\nExample vocabulary terms:\")\n",
        "print(vectorizer.get_feature_names_out()[:30])"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "20c34b51",
      "metadata": {
        "id": "20c34b51"
      },
      "source": [
        "# 5. Retrieval using cosine similarity\n",
        "\n",
        "For a query `q` and chunk vectors `c1, c2, ..., cn`, we compute:\n",
        "\n",
        "```text\n",
        "similarity(q, ci) = cosine(q, ci)\n",
        "```\n",
        "\n",
        "Then we rank chunks by similarity and return the top `k`.\n",
        "\n",
        "This is the core retrieval step in RAG:\n",
        "\n",
        "```text\n",
        "query -> vector -> cosine similarity -> top-k chunks\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5f50be03",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 293
        },
        "id": "5f50be03",
        "outputId": "60c04e96-f4be-4fcb-d91f-f19030b1a0b6"
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "        chunk_id doc_id                           title  \\\n",
              "0  doc_6_chunk_3  doc_6  Retrieval-Augmented Generation   \n",
              "1  doc_8_chunk_2  doc_8                   Hallucination   \n",
              "2  doc_8_chunk_1  doc_8                   Hallucination   \n",
              "3  doc_1_chunk_3  doc_1                    Bag of Words   \n",
              "4  doc_6_chunk_0  doc_6  Retrieval-Augmented Generation   \n",
              "\n",
              "                                                text  similarity  \n",
              "0  RAG can reduce hallucinations and allows model...    0.463610  \n",
              "1  Grounding answers in retrieved sources can red...    0.311373  \n",
              "2  In RAG systems, hallucinations may still happe...    0.261038  \n",
              "3  Bag of Words is useful as a baseline but does ...    0.224403  \n",
              "4  Retrieval-Augmented Generation, or RAG, combin...    0.111730  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-bff6230c-a04a-42a3-96ec-8fa4a9dbadaf\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>chunk_id</th>\n",
              "      <th>doc_id</th>\n",
              "      <th>title</th>\n",
              "      <th>text</th>\n",
              "      <th>similarity</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>doc_6_chunk_3</td>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>RAG can reduce hallucinations and allows model...</td>\n",
              "      <td>0.463610</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>doc_8_chunk_2</td>\n",
              "      <td>doc_8</td>\n",
              "      <td>Hallucination</td>\n",
              "      <td>Grounding answers in retrieved sources can red...</td>\n",
              "      <td>0.311373</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>doc_8_chunk_1</td>\n",
              "      <td>doc_8</td>\n",
              "      <td>Hallucination</td>\n",
              "      <td>In RAG systems, hallucinations may still happe...</td>\n",
              "      <td>0.261038</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>doc_1_chunk_3</td>\n",
              "      <td>doc_1</td>\n",
              "      <td>Bag of Words</td>\n",
              "      <td>Bag of Words is useful as a baseline but does ...</td>\n",
              "      <td>0.224403</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>doc_6_chunk_0</td>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>Retrieval-Augmented Generation, or RAG, combin...</td>\n",
              "      <td>0.111730</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-bff6230c-a04a-42a3-96ec-8fa4a9dbadaf')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-bff6230c-a04a-42a3-96ec-8fa4a9dbadaf button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-bff6230c-a04a-42a3-96ec-8fa4a9dbadaf');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "  <div id=\"id_2b5dcc64-bb5b-46a0-84a8-d373ea10b0c9\">\n",
              "    <style>\n",
              "      .colab-df-generate {\n",
              "        background-color: #E8F0FE;\n",
              "        border: none;\n",
              "        border-radius: 50%;\n",
              "        cursor: pointer;\n",
              "        display: none;\n",
              "        fill: #1967D2;\n",
              "        height: 32px;\n",
              "        padding: 0 0 0 0;\n",
              "        width: 32px;\n",
              "      }\n",
              "\n",
              "      .colab-df-generate:hover {\n",
              "        background-color: #E2EBFA;\n",
              "        box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "        fill: #174EA6;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate {\n",
              "        background-color: #3B4455;\n",
              "        fill: #D2E3FC;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate:hover {\n",
              "        background-color: #434B5C;\n",
              "        box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "        filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "        fill: #FFFFFF;\n",
              "      }\n",
              "    </style>\n",
              "    <button class=\"colab-df-generate\" onclick=\"generateWithVariable('retrieved')\"\n",
              "            title=\"Generate code using this dataframe.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "    <script>\n",
              "      (() => {\n",
              "      const buttonEl =\n",
              "        document.querySelector('#id_2b5dcc64-bb5b-46a0-84a8-d373ea10b0c9 button.colab-df-generate');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      buttonEl.onclick = () => {\n",
              "        google.colab.notebook.generateWithVariable('retrieved');\n",
              "      }\n",
              "      })();\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "retrieved",
              "summary": "{\n  \"name\": \"retrieved\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"chunk_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"doc_8_chunk_2\",\n          \"doc_6_chunk_0\",\n          \"doc_8_chunk_1\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"doc_6\",\n          \"doc_8\",\n          \"doc_1\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"Retrieval-Augmented Generation\",\n          \"Hallucination\",\n          \"Bag of Words\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"Grounding answers in retrieved sources can reduce but not completely eliminate hallucinations.\",\n          \"Retrieval-Augmented Generation, or RAG, combines information retrieval with text generation.\",\n          \"In RAG systems, hallucinations may still happen if the retrieved context is irrelevant, incomplete, or ignored by the generator.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"similarity\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.12875097222709006,\n        \"min\": 0.1117302127092086,\n        \"max\": 0.46361029986060764,\n        \"num_unique_values\": 5,\n        \"samples\": [\n          0.31137321653390393,\n          0.1117302127092086,\n          0.26103799288786533\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {},
          "execution_count": 5
        }
      ],
      "source": [
        "# Cell 5 — Retrieval function\n",
        "\n",
        "from sklearn.metrics.pairwise import cosine_similarity\n",
        "import numpy as np\n",
        "\n",
        "\n",
        "def retrieve(query, vectorizer, chunk_vectors, chunks_df, top_k=3):\n",
        "    \"\"\"\n",
        "    Retrieve top_k chunks most similar to the query.\n",
        "    \"\"\"\n",
        "    query_vector = vectorizer.transform([query])\n",
        "    similarities = cosine_similarity(query_vector, chunk_vectors)[0]\n",
        "    top_indices = np.argsort(similarities)[::-1][:top_k]\n",
        "\n",
        "    results = chunks_df.iloc[top_indices].copy()\n",
        "    results[\"similarity\"] = similarities[top_indices]\n",
        "\n",
        "    return results.reset_index(drop=True)\n",
        "\n",
        "\n",
        "query = \"How does RAG reduce hallucinations?\"\n",
        "\n",
        "retrieved = retrieve(\n",
        "    query=query,\n",
        "    vectorizer=vectorizer,\n",
        "    chunk_vectors=chunk_vectors,\n",
        "    chunks_df=chunks_df,\n",
        "    top_k=5\n",
        ")\n",
        "\n",
        "retrieved"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "0c0e1a51",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "0c0e1a51",
        "outputId": "6e699f4f-36df-46ca-b1bd-ee2e10809886"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Query:\n",
            "How does RAG reduce hallucinations?\n",
            "\n",
            "Retrieved chunks:\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 1\n",
            "Source: doc_6 — Retrieval-Augmented Generation\n",
            "Similarity: 0.464\n",
            "Text: RAG can reduce hallucinations and allows models to use external knowledge.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 2\n",
            "Source: doc_8 — Hallucination\n",
            "Similarity: 0.311\n",
            "Text: Grounding answers in retrieved sources can reduce but not completely eliminate hallucinations.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 3\n",
            "Source: doc_8 — Hallucination\n",
            "Similarity: 0.261\n",
            "Text: In RAG systems, hallucinations may still happen if the retrieved context is irrelevant, incomplete, or ignored by the generator.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 4\n",
            "Source: doc_1 — Bag of Words\n",
            "Similarity: 0.224\n",
            "Text: Bag of Words is useful as a baseline but does not capture semantic similarity.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 5\n",
            "Source: doc_6 — Retrieval-Augmented Generation\n",
            "Similarity: 0.112\n",
            "Text: Retrieval-Augmented Generation, or RAG, combines information retrieval with text generation.\n",
            "--------------------------------------------------------------------------------\n"
          ]
        }
      ],
      "source": [
        "# Cell 6 — Display retrieved context in a readable way\n",
        "\n",
        "\n",
        "def print_retrieved_results(query, results):\n",
        "    print(\"Query:\")\n",
        "    print(query)\n",
        "    print(\"\\nRetrieved chunks:\")\n",
        "    print(\"-\" * 80)\n",
        "\n",
        "    for i, row in results.iterrows():\n",
        "        print(f\"Rank {i + 1}\")\n",
        "        print(f\"Source: {row['doc_id']} — {row['title']}\")\n",
        "        print(f\"Similarity: {row['similarity']:.3f}\")\n",
        "        print(f\"Text: {row['text']}\")\n",
        "        print(\"-\" * 80)\n",
        "\n",
        "\n",
        "print_retrieved_results(query, retrieved)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "274c5d25",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "274c5d25",
        "outputId": "386835cb-2ac3-4d21-9eea-7ef75083d5ef"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Query:\n",
            "What is TF-IDF?\n",
            "\n",
            "Retrieved chunks:\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 1\n",
            "Source: doc_2 — TF-IDF\n",
            "Similarity: 0.538\n",
            "Text: TF-IDF is a weighting method for text representation.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 2\n",
            "Source: doc_3 — Cosine Similarity\n",
            "Similarity: 0.442\n",
            "Text: It is commonly used to compare TF-IDF vectors or embedding vectors.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 3\n",
            "Source: doc_2 — TF-IDF\n",
            "Similarity: 0.384\n",
            "Text: TF-IDF stands for Term Frequency Inverse Document Frequency.\n",
            "--------------------------------------------------------------------------------\n",
            "\n",
            "\n",
            "\n",
            "Query:\n",
            "How do transformers use attention?\n",
            "\n",
            "Retrieved chunks:\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 1\n",
            "Source: doc_5 — Transformers\n",
            "Similarity: 0.391\n",
            "Text: Transformers are neural network architectures based on the attention mechanism.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 2\n",
            "Source: doc_6 — Retrieval-Augmented Generation\n",
            "Similarity: 0.240\n",
            "Text: RAG can reduce hallucinations and allows models to use external knowledge.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 3\n",
            "Source: doc_5 — Transformers\n",
            "Similarity: 0.177\n",
            "Text: Attention allows the model to focus on different tokens when building contextual representations.\n",
            "--------------------------------------------------------------------------------\n",
            "\n",
            "\n",
            "\n",
            "Query:\n",
            "Why can language models hallucinate?\n",
            "\n",
            "Retrieved chunks:\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 1\n",
            "Source: doc_8 — Hallucination\n",
            "Similarity: 0.258\n",
            "Text: A hallucination occurs when a language model generates information that is not supported by the available evidence.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 2\n",
            "Source: doc_7 — Model Evaluation\n",
            "Similarity: 0.246\n",
            "Text: Machine learning models should be evaluated on unseen data.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 3\n",
            "Source: doc_6 — Retrieval-Augmented Generation\n",
            "Similarity: 0.229\n",
            "Text: RAG can reduce hallucinations and allows models to use external knowledge.\n",
            "--------------------------------------------------------------------------------\n",
            "\n",
            "\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# Cell 7 — TODO: Test retrieval with your own queries\n",
        "#\n",
        "# TODO:\n",
        "# 1. Create at least three questions.\n",
        "# 2. Retrieve top 3 chunks for each question.\n",
        "# 3. Decide whether the retrieved chunks are relevant.\n",
        "\n",
        "my_queries = [\n",
        "    \"What is TF-IDF?\",\n",
        "    \"How do transformers use attention?\",\n",
        "    \"Why can language models hallucinate?\"\n",
        "]\n",
        "\n",
        "for q in my_queries:\n",
        "    results = retrieve(q, vectorizer, chunk_vectors, chunks_df, top_k=3)\n",
        "    print_retrieved_results(q, results)\n",
        "    print(\"\\n\\n\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "652604c9",
      "metadata": {
        "id": "652604c9"
      },
      "source": [
        "# 6. Building a simple RAG pipeline\n",
        "\n",
        "A full RAG pipeline has two main stages:\n",
        "\n",
        "```text\n",
        "1. Retrieve relevant context\n",
        "2. Generate an answer using that context\n",
        "```\n",
        "\n",
        "In production systems, the generation step is usually performed by a language model.\n",
        "\n",
        "In this lab, to keep everything transparent and reproducible, we will create a simple extractive answer generator.\n",
        "\n",
        "It will not truly generate new language. Instead, it will:\n",
        "\n",
        "1. retrieve relevant chunks,\n",
        "2. combine them into a context,\n",
        "3. return a short answer based only on retrieved text.\n",
        "\n",
        "This helps us understand the grounding idea behind RAG."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "ec73f552",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 288
        },
        "id": "ec73f552",
        "outputId": "0b463aa2-4189-46f9-9cb5-19effd2929a1"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Question:\n",
            "What is Retrieval-Augmented Generation?\n",
            "\n",
            "Answer based on retrieved context:\n",
            "Retrieval-Augmented Generation, or RAG, combines information retrieval with text generation. For retrieval systems, we can measure whether relevant documents appear in the top retrieved results. In RAG systems, hallucinations may still happen if the retrieved context is irrelevant, incomplete, or ignored by the generator.\n",
            "\n",
            "Sources:\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "  doc_id                           title       chunk_id  similarity\n",
              "0  doc_6  Retrieval-Augmented Generation  doc_6_chunk_0    0.816646\n",
              "1  doc_7                Model Evaluation  doc_7_chunk_2    0.183025\n",
              "2  doc_8                   Hallucination  doc_8_chunk_1    0.000000"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-e051dc55-13b6-4e58-97c8-5f480c7331bd\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>doc_id</th>\n",
              "      <th>title</th>\n",
              "      <th>chunk_id</th>\n",
              "      <th>similarity</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>doc_6_chunk_0</td>\n",
              "      <td>0.816646</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>doc_7</td>\n",
              "      <td>Model Evaluation</td>\n",
              "      <td>doc_7_chunk_2</td>\n",
              "      <td>0.183025</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>doc_8</td>\n",
              "      <td>Hallucination</td>\n",
              "      <td>doc_8_chunk_1</td>\n",
              "      <td>0.000000</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-e051dc55-13b6-4e58-97c8-5f480c7331bd')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-e051dc55-13b6-4e58-97c8-5f480c7331bd button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-e051dc55-13b6-4e58-97c8-5f480c7331bd');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "  <div id=\"id_32bf3fd8-95a9-4ace-b179-fc1c3b44faf8\">\n",
              "    <style>\n",
              "      .colab-df-generate {\n",
              "        background-color: #E8F0FE;\n",
              "        border: none;\n",
              "        border-radius: 50%;\n",
              "        cursor: pointer;\n",
              "        display: none;\n",
              "        fill: #1967D2;\n",
              "        height: 32px;\n",
              "        padding: 0 0 0 0;\n",
              "        width: 32px;\n",
              "      }\n",
              "\n",
              "      .colab-df-generate:hover {\n",
              "        background-color: #E2EBFA;\n",
              "        box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "        fill: #174EA6;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate {\n",
              "        background-color: #3B4455;\n",
              "        fill: #D2E3FC;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate:hover {\n",
              "        background-color: #434B5C;\n",
              "        box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "        filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "        fill: #FFFFFF;\n",
              "      }\n",
              "    </style>\n",
              "    <button class=\"colab-df-generate\" onclick=\"generateWithVariable('sources')\"\n",
              "            title=\"Generate code using this dataframe.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "    <script>\n",
              "      (() => {\n",
              "      const buttonEl =\n",
              "        document.querySelector('#id_32bf3fd8-95a9-4ace-b179-fc1c3b44faf8 button.colab-df-generate');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      buttonEl.onclick = () => {\n",
              "        google.colab.notebook.generateWithVariable('sources');\n",
              "      }\n",
              "      })();\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "sources",
              "summary": "{\n  \"name\": \"sources\",\n  \"rows\": 3,\n  \"fields\": [\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"doc_6\",\n          \"doc_7\",\n          \"doc_8\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"Retrieval-Augmented Generation\",\n          \"Model Evaluation\",\n          \"Hallucination\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"chunk_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"doc_6_chunk_0\",\n          \"doc_7_chunk_2\",\n          \"doc_8_chunk_1\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"similarity\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.42854125834384216,\n        \"min\": 0.0,\n        \"max\": 0.816646471895088,\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.816646471895088,\n          0.18302513283801927,\n          0.0\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {}
        }
      ],
      "source": [
        "# Cell 8 — Simple extractive RAG answer\n",
        "\n",
        "\n",
        "def simple_rag_answer(query, top_k=3):\n",
        "    \"\"\"\n",
        "    A simple RAG-like pipeline:\n",
        "    1. retrieve top-k chunks,\n",
        "    2. create context,\n",
        "    3. return a short answer based only on retrieved chunks.\n",
        "    \"\"\"\n",
        "    results = retrieve(\n",
        "        query=query,\n",
        "        vectorizer=vectorizer,\n",
        "        chunk_vectors=chunk_vectors,\n",
        "        chunks_df=chunks_df,\n",
        "        top_k=top_k\n",
        "    )\n",
        "\n",
        "    context_sentences = results[\"text\"].tolist()\n",
        "    answer = \" \".join(context_sentences)\n",
        "    sources = results[[\"doc_id\", \"title\", \"chunk_id\", \"similarity\"]]\n",
        "\n",
        "    return answer, sources, results\n",
        "\n",
        "\n",
        "query = \"What is Retrieval-Augmented Generation?\"\n",
        "answer, sources, results = simple_rag_answer(query, top_k=3)\n",
        "\n",
        "print(\"Question:\")\n",
        "print(query)\n",
        "\n",
        "print(\"\\nAnswer based on retrieved context:\")\n",
        "print(answer)\n",
        "\n",
        "print(\"\\nSources:\")\n",
        "display(sources)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "cbe425c0",
      "metadata": {
        "id": "cbe425c0"
      },
      "source": [
        "## Discussion\n",
        "\n",
        "This answer is not elegant, but it is grounded.\n",
        "\n",
        "It uses only information from retrieved chunks.\n",
        "\n",
        "In a real system, the retrieved chunks would be inserted into a prompt such as:\n",
        "\n",
        "```text\n",
        "Use only the following context to answer the question.\n",
        "\n",
        "Context:\n",
        "...\n",
        "\n",
        "Question:\n",
        "...\n",
        "\n",
        "Answer:\n",
        "```\n",
        "\n",
        "Then a language model would generate a natural-language answer.\n",
        "\n",
        "The quality of the final answer strongly depends on retrieval quality."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "06c58f0d",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 791
        },
        "id": "06c58f0d",
        "outputId": "080b7eeb-7890-48ba-f960-08aa24423bd0"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "====================================================================================================\n",
            "Question: Why is cosine similarity useful in NLP?\n",
            "\n",
            "Answer:\n",
            "Cosine similarity measures the angle between two vectors. Bag of Words is useful as a baseline but does not capture semantic similarity. It is often used in search engines and classical NLP pipelines.\n",
            "\n",
            "Sources:\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "  doc_id              title       chunk_id  similarity\n",
              "0  doc_3  Cosine Similarity  doc_3_chunk_0    0.438476\n",
              "1  doc_1       Bag of Words  doc_1_chunk_3    0.354738\n",
              "2  doc_2             TF-IDF  doc_2_chunk_3    0.186492"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-44df8181-c2d0-4c2e-886a-36db3aba5349\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>doc_id</th>\n",
              "      <th>title</th>\n",
              "      <th>chunk_id</th>\n",
              "      <th>similarity</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>doc_3</td>\n",
              "      <td>Cosine Similarity</td>\n",
              "      <td>doc_3_chunk_0</td>\n",
              "      <td>0.438476</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>doc_1</td>\n",
              "      <td>Bag of Words</td>\n",
              "      <td>doc_1_chunk_3</td>\n",
              "      <td>0.354738</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>doc_2</td>\n",
              "      <td>TF-IDF</td>\n",
              "      <td>doc_2_chunk_3</td>\n",
              "      <td>0.186492</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-44df8181-c2d0-4c2e-886a-36db3aba5349')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-44df8181-c2d0-4c2e-886a-36db3aba5349 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-44df8181-c2d0-4c2e-886a-36db3aba5349');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "  <div id=\"id_116656f1-62bf-4a9d-bba4-f12d234ac6fe\">\n",
              "    <style>\n",
              "      .colab-df-generate {\n",
              "        background-color: #E8F0FE;\n",
              "        border: none;\n",
              "        border-radius: 50%;\n",
              "        cursor: pointer;\n",
              "        display: none;\n",
              "        fill: #1967D2;\n",
              "        height: 32px;\n",
              "        padding: 0 0 0 0;\n",
              "        width: 32px;\n",
              "      }\n",
              "\n",
              "      .colab-df-generate:hover {\n",
              "        background-color: #E2EBFA;\n",
              "        box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "        fill: #174EA6;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate {\n",
              "        background-color: #3B4455;\n",
              "        fill: #D2E3FC;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate:hover {\n",
              "        background-color: #434B5C;\n",
              "        box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "        filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "        fill: #FFFFFF;\n",
              "      }\n",
              "    </style>\n",
              "    <button class=\"colab-df-generate\" onclick=\"generateWithVariable('sources')\"\n",
              "            title=\"Generate code using this dataframe.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "    <script>\n",
              "      (() => {\n",
              "      const buttonEl =\n",
              "        document.querySelector('#id_116656f1-62bf-4a9d-bba4-f12d234ac6fe button.colab-df-generate');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      buttonEl.onclick = () => {\n",
              "        google.colab.notebook.generateWithVariable('sources');\n",
              "      }\n",
              "      })();\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "sources",
              "summary": "{\n  \"name\": \"sources\",\n  \"rows\": 3,\n  \"fields\": [\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"doc_3\",\n          \"doc_1\",\n          \"doc_2\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"Cosine Similarity\",\n          \"Bag of Words\",\n          \"TF-IDF\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"chunk_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"doc_3_chunk_0\",\n          \"doc_1_chunk_3\",\n          \"doc_2_chunk_3\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"similarity\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.12833193685122185,\n        \"min\": 0.1864923859359534,\n        \"max\": 0.4384761399947755,\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.4384761399947755,\n          0.35473824284626165,\n          0.1864923859359534\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "====================================================================================================\n",
            "Question: What are word embeddings?\n",
            "\n",
            "Answer:\n",
            "Word embeddings represent words as dense vectors. It ignores word order and counts how many times each word appears in a document. Embeddings can capture semantic similarity better than Bag of Words.\n",
            "\n",
            "Sources:\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "  doc_id            title       chunk_id  similarity\n",
              "0  doc_4  Word Embeddings  doc_4_chunk_0    0.575785\n",
              "1  doc_1     Bag of Words  doc_1_chunk_1    0.381083\n",
              "2  doc_4  Word Embeddings  doc_4_chunk_2    0.289908"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-12a91085-1ecf-412e-9f5c-01d06ebb32a1\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>doc_id</th>\n",
              "      <th>title</th>\n",
              "      <th>chunk_id</th>\n",
              "      <th>similarity</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>doc_4</td>\n",
              "      <td>Word Embeddings</td>\n",
              "      <td>doc_4_chunk_0</td>\n",
              "      <td>0.575785</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>doc_1</td>\n",
              "      <td>Bag of Words</td>\n",
              "      <td>doc_1_chunk_1</td>\n",
              "      <td>0.381083</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>doc_4</td>\n",
              "      <td>Word Embeddings</td>\n",
              "      <td>doc_4_chunk_2</td>\n",
              "      <td>0.289908</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-12a91085-1ecf-412e-9f5c-01d06ebb32a1')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-12a91085-1ecf-412e-9f5c-01d06ebb32a1 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-12a91085-1ecf-412e-9f5c-01d06ebb32a1');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "  <div id=\"id_8ed18e08-99b0-496c-9433-11038216a36f\">\n",
              "    <style>\n",
              "      .colab-df-generate {\n",
              "        background-color: #E8F0FE;\n",
              "        border: none;\n",
              "        border-radius: 50%;\n",
              "        cursor: pointer;\n",
              "        display: none;\n",
              "        fill: #1967D2;\n",
              "        height: 32px;\n",
              "        padding: 0 0 0 0;\n",
              "        width: 32px;\n",
              "      }\n",
              "\n",
              "      .colab-df-generate:hover {\n",
              "        background-color: #E2EBFA;\n",
              "        box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "        fill: #174EA6;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate {\n",
              "        background-color: #3B4455;\n",
              "        fill: #D2E3FC;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate:hover {\n",
              "        background-color: #434B5C;\n",
              "        box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "        filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "        fill: #FFFFFF;\n",
              "      }\n",
              "    </style>\n",
              "    <button class=\"colab-df-generate\" onclick=\"generateWithVariable('sources')\"\n",
              "            title=\"Generate code using this dataframe.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "    <script>\n",
              "      (() => {\n",
              "      const buttonEl =\n",
              "        document.querySelector('#id_8ed18e08-99b0-496c-9433-11038216a36f button.colab-df-generate');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      buttonEl.onclick = () => {\n",
              "        google.colab.notebook.generateWithVariable('sources');\n",
              "      }\n",
              "      })();\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "sources",
              "summary": "{\n  \"name\": \"sources\",\n  \"rows\": 3,\n  \"fields\": [\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"doc_1\",\n          \"doc_4\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"Bag of Words\",\n          \"Word Embeddings\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"chunk_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"doc_4_chunk_0\",\n          \"doc_1_chunk_1\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"similarity\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.14602964251220996,\n        \"min\": 0.28990776218457187,\n        \"max\": 0.5757852949117614,\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.5757852949117614,\n          0.3810826735130893\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "====================================================================================================\n",
            "Question: How should machine learning models be evaluated?\n",
            "\n",
            "Answer:\n",
            "Machine learning models should be evaluated on unseen data. RAG can reduce hallucinations and allows models to use external knowledge. A hallucination occurs when a language model generates information that is not supported by the available evidence.\n",
            "\n",
            "Sources:\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "  doc_id                           title       chunk_id  similarity\n",
              "0  doc_7                Model Evaluation  doc_7_chunk_0    0.809242\n",
              "1  doc_6  Retrieval-Augmented Generation  doc_6_chunk_3    0.157510\n",
              "2  doc_8                   Hallucination  doc_8_chunk_0    0.000000"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-41b9ebb4-2e3b-4797-83e8-f132b4aead6d\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>doc_id</th>\n",
              "      <th>title</th>\n",
              "      <th>chunk_id</th>\n",
              "      <th>similarity</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>doc_7</td>\n",
              "      <td>Model Evaluation</td>\n",
              "      <td>doc_7_chunk_0</td>\n",
              "      <td>0.809242</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>doc_6_chunk_3</td>\n",
              "      <td>0.157510</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>doc_8</td>\n",
              "      <td>Hallucination</td>\n",
              "      <td>doc_8_chunk_0</td>\n",
              "      <td>0.000000</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-41b9ebb4-2e3b-4797-83e8-f132b4aead6d')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-41b9ebb4-2e3b-4797-83e8-f132b4aead6d button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-41b9ebb4-2e3b-4797-83e8-f132b4aead6d');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "  <div id=\"id_b5b2906c-7604-43a3-8ea1-e38db4742ad1\">\n",
              "    <style>\n",
              "      .colab-df-generate {\n",
              "        background-color: #E8F0FE;\n",
              "        border: none;\n",
              "        border-radius: 50%;\n",
              "        cursor: pointer;\n",
              "        display: none;\n",
              "        fill: #1967D2;\n",
              "        height: 32px;\n",
              "        padding: 0 0 0 0;\n",
              "        width: 32px;\n",
              "      }\n",
              "\n",
              "      .colab-df-generate:hover {\n",
              "        background-color: #E2EBFA;\n",
              "        box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "        fill: #174EA6;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate {\n",
              "        background-color: #3B4455;\n",
              "        fill: #D2E3FC;\n",
              "      }\n",
              "\n",
              "      [theme=dark] .colab-df-generate:hover {\n",
              "        background-color: #434B5C;\n",
              "        box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "        filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "        fill: #FFFFFF;\n",
              "      }\n",
              "    </style>\n",
              "    <button class=\"colab-df-generate\" onclick=\"generateWithVariable('sources')\"\n",
              "            title=\"Generate code using this dataframe.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "    <script>\n",
              "      (() => {\n",
              "      const buttonEl =\n",
              "        document.querySelector('#id_b5b2906c-7604-43a3-8ea1-e38db4742ad1 button.colab-df-generate');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      buttonEl.onclick = () => {\n",
              "        google.colab.notebook.generateWithVariable('sources');\n",
              "      }\n",
              "      })();\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "variable_name": "sources",
              "summary": "{\n  \"name\": \"sources\",\n  \"rows\": 3,\n  \"fields\": [\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"doc_7\",\n          \"doc_6\",\n          \"doc_8\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"Model Evaluation\",\n          \"Retrieval-Augmented Generation\",\n          \"Hallucination\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"chunk_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"doc_7_chunk_0\",\n          \"doc_6_chunk_3\",\n          \"doc_8_chunk_0\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"similarity\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.4290372469226054,\n        \"min\": 0.0,\n        \"max\": 0.8092423238176495,\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.8092423238176495,\n          0.15750980290757624,\n          0.0\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {}
        }
      ],
      "source": [
        "# Cell 9 — TODO: Create answers for several questions\n",
        "#\n",
        "# TODO:\n",
        "# 1. Create three questions about the knowledge base.\n",
        "# 2. Use `simple_rag_answer`.\n",
        "# 3. Inspect the retrieved sources.\n",
        "# 4. Decide whether the answer is well grounded.\n",
        "\n",
        "questions = [\n",
        "    \"Why is cosine similarity useful in NLP?\",\n",
        "    \"What are word embeddings?\",\n",
        "    \"How should machine learning models be evaluated?\"\n",
        "]\n",
        "\n",
        "for q in questions:\n",
        "    answer, sources, results = simple_rag_answer(q, top_k=3)\n",
        "\n",
        "    print(\"=\" * 100)\n",
        "    print(\"Question:\", q)\n",
        "    print(\"\\nAnswer:\")\n",
        "    print(answer)\n",
        "    print(\"\\nSources:\")\n",
        "    display(sources)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6ed80243",
      "metadata": {
        "id": "6ed80243"
      },
      "source": [
        "# 7. Prompt construction for RAG\n",
        "\n",
        "In real RAG systems, the retrieved chunks are inserted into a prompt.\n",
        "\n",
        "The prompt typically contains:\n",
        "\n",
        "1. system instruction,\n",
        "2. retrieved context,\n",
        "3. user question,\n",
        "4. constraints for the answer.\n",
        "\n",
        "For example:\n",
        "\n",
        "```text\n",
        "You are an assistant answering questions using only the provided context.\n",
        "If the context does not contain the answer, say that the answer is not available.\n",
        "\n",
        "Context:\n",
        "[chunk 1]\n",
        "[chunk 2]\n",
        "[chunk 3]\n",
        "\n",
        "Question:\n",
        "...\n",
        "\n",
        "Answer:\n",
        "```\n",
        "\n",
        "This is important because the generator should be encouraged to use retrieved evidence instead of inventing unsupported information."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "bfc0745c",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "bfc0745c",
        "outputId": "e201d269-e82a-4e8a-c75e-755932e7c0b9"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "You are an assistant answering questions using only the provided context.\n",
            "If the context does not contain the answer, say: \"The provided context does not contain enough information.\"\n",
            "\n",
            "Context:\n",
            "[Source 1: doc_6 — Retrieval-Augmented Generation]\n",
            "RAG can reduce hallucinations and allows models to use external knowledge.\n",
            "\n",
            "[Source 2: doc_8 — Hallucination]\n",
            "Grounding answers in retrieved sources can reduce but not completely eliminate hallucinations.\n",
            "\n",
            "[Source 3: doc_8 — Hallucination]\n",
            "In RAG systems, hallucinations may still happen if the retrieved context is irrelevant, incomplete, or ignored by the generator.\n",
            "\n",
            "Question:\n",
            "How does RAG reduce hallucinations?\n",
            "\n",
            "Answer:\n"
          ]
        }
      ],
      "source": [
        "# Cell 10 — Build a RAG prompt\n",
        "\n",
        "\n",
        "def build_rag_prompt(query, retrieved_results):\n",
        "    context_parts = []\n",
        "\n",
        "    for i, row in retrieved_results.iterrows():\n",
        "        context_parts.append(\n",
        "            f\"[Source {i + 1}: {row['doc_id']} — {row['title']}]\\n{row['text']}\"\n",
        "        )\n",
        "\n",
        "    context = \"\\n\\n\".join(context_parts)\n",
        "\n",
        "    prompt = f\"\"\"\n",
        "You are an assistant answering questions using only the provided context.\n",
        "If the context does not contain the answer, say: \"The provided context does not contain enough information.\"\n",
        "\n",
        "Context:\n",
        "{context}\n",
        "\n",
        "Question:\n",
        "{query}\n",
        "\n",
        "Answer:\n",
        "\"\"\"\n",
        "    return prompt.strip()\n",
        "\n",
        "\n",
        "query = \"How does RAG reduce hallucinations?\"\n",
        "retrieved = retrieve(query, vectorizer, chunk_vectors, chunks_df, top_k=3)\n",
        "\n",
        "prompt = build_rag_prompt(query, retrieved)\n",
        "\n",
        "print(prompt)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "4a6c937d",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4a6c937d",
        "outputId": "ab24caba-921e-40d3-910b-c969b4193d52"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "You are an assistant answering questions using only the provided context.\n",
            "Answer in no more than three sentences.\n",
            "Cite source IDs, for example [doc_6].\n",
            "Do not use information outside the context.\n",
            "If the answer is not in the context, say: \"The provided context does not contain enough information.\"\n",
            "\n",
            "Context:\n",
            "[Source 1: doc_6 — Retrieval-Augmented Generation]\n",
            "RAG can reduce hallucinations and allows models to use external knowledge.\n",
            "\n",
            "[Source 2: doc_8 — Hallucination]\n",
            "Grounding answers in retrieved sources can reduce but not completely eliminate hallucinations.\n",
            "\n",
            "[Source 3: doc_8 — Hallucination]\n",
            "In RAG systems, hallucinations may still happen if the retrieved context is irrelevant, incomplete, or ignored by the generator.\n",
            "\n",
            "Question:\n",
            "How does RAG reduce hallucinations?\n",
            "\n",
            "Answer:\n"
          ]
        }
      ],
      "source": [
        "# Cell 11 — TODO: Modify the prompt\n",
        "#\n",
        "# TODO:\n",
        "# Modify the prompt-building function so that it additionally asks the model to:\n",
        "# 1. answer in no more than three sentences,\n",
        "# 2. cite source IDs,\n",
        "# 3. avoid using information outside the context.\n",
        "\n",
        "\n",
        "def build_custom_rag_prompt(query, retrieved_results):\n",
        "    context_parts = []\n",
        "\n",
        "    for i, row in retrieved_results.iterrows():\n",
        "        context_parts.append(\n",
        "            f\"[Source {i + 1}: {row['doc_id']} — {row['title']}]\\n{row['text']}\"\n",
        "        )\n",
        "\n",
        "    context = \"\\n\\n\".join(context_parts)\n",
        "\n",
        "    prompt = f\"\"\"\n",
        "You are an assistant answering questions using only the provided context.\n",
        "Answer in no more than three sentences.\n",
        "Cite source IDs, for example [doc_6].\n",
        "Do not use information outside the context.\n",
        "If the answer is not in the context, say: \"The provided context does not contain enough information.\"\n",
        "\n",
        "Context:\n",
        "{context}\n",
        "\n",
        "Question:\n",
        "{query}\n",
        "\n",
        "Answer:\n",
        "\"\"\"\n",
        "    return prompt.strip()\n",
        "\n",
        "\n",
        "custom_prompt = build_custom_rag_prompt(query, retrieved)\n",
        "print(custom_prompt)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "3603186e",
      "metadata": {
        "id": "3603186e"
      },
      "source": [
        "# 8. Hugging Face generator inside a RAG pipeline\n",
        "\n",
        "So far, our RAG pipeline used a very simple extractive answer generator. It joined retrieved chunks together and returned them as an answer.\n",
        "\n",
        "Now we add an optional Hugging Face text-generation model.\n",
        "\n",
        "The updated pipeline is:\n",
        "\n",
        "```text\n",
        "question\n",
        "-> retrieve top-k chunks\n",
        "-> build prompt with context\n",
        "-> Hugging Face text2text model\n",
        "-> grounded answer\n",
        "```\n",
        "\n",
        "We will use a small instruction-following model such as `google/flan-t5-small` or `google/flan-t5-base`.\n",
        "\n",
        "Important practical note:\n",
        "\n",
        "- this section requires internet access the first time the model is downloaded,\n",
        "- it may be slower on CPU,\n",
        "- if the model is unavailable, the rest of the lab still works.\n",
        "\n",
        "The goal is not to build the strongest possible RAG system. The goal is to see how a generator can use retrieved context."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2963e1fe",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 212,
          "referenced_widgets": [
            "4bad55bd31334bd78bc1ce356a5bda0e",
            "44e84d091f1144f4b370949ea8a34ada",
            "315b1318f15e4ea58a55374e721c9769",
            "8f2e0cb009854b658968b4f9e0e93f70",
            "4dd91cff9b084c99b78c4e9f5ebb7022",
            "4f4a33c72f424d32ac6cecb8bd2300fe",
            "ad72f9fb43854265b213f127c5060cdf",
            "331976f66e4e4741bd676550853c1bc6",
            "d6d5db77d26a4d5d89d83cd34f2fa16e",
            "40391c81c47b4819badf919245fec20e",
            "bbc635adecef43bb9dd2376d95592712"
          ]
        },
        "id": "2963e1fe",
        "outputId": "fb0c00f7-29f5-4b6e-bd31-9c81f1a99a5b"
      },
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]"
            ],
            "application/vnd.jupyter.widget-view+json": {
              "version_major": 2,
              "version_minor": 0,
              "model_id": "4bad55bd31334bd78bc1ce356a5bda0e"
            }
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "GPT2LMHeadModel LOAD REPORT from: gpt2\n",
            "Key                  | Status     |  | \n",
            "---------------------+------------+--+-\n",
            "h.{0...11}.attn.bias | UNEXPECTED |  | \n",
            "\n",
            "Notes:\n",
            "- UNEXPECTED\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Hugging Face generator loaded successfully.\n"
          ]
        }
      ],
      "source": [
        "import warnings\n",
        "warnings.filterwarnings('ignore')\n",
        "\n",
        "# Cell HF-1 — Optional Hugging Face text2text generation model\n",
        "#\n",
        "# In Colab or a fresh environment, you may need:\n",
        "# !pip install transformers sentencepiece accelerate\n",
        "#\n",
        "# We use a try/except block so the notebook can still run if the model\n",
        "# cannot be downloaded in the current environment.\n",
        "\n",
        "try:\n",
        "    from transformers import pipeline\n",
        "\n",
        "    hf_generator = pipeline(\n",
        "        task=\"text-generation\",\n",
        "        model=\"gpt2\",\n",
        "        max_new_tokens=120\n",
        "    )\n",
        "\n",
        "    print(\"Hugging Face generator loaded successfully.\")\n",
        "\n",
        "except Exception as e:\n",
        "    hf_generator = None\n",
        "    print(\"Hugging Face generator is not available in this environment.\")\n",
        "    print(\"You can still use the extractive RAG pipeline from previous cells.\")\n",
        "    print(\"Error:\", e)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "185e8c0c",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "185e8c0c",
        "outputId": "7d0e6bbc-1e85-4f17-d398-246394648b34"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Answer the question using only the context below.\n",
            "If the context does not contain the answer, say: I do not know based on the context.\n",
            "\n",
            "Context:\n",
            "Source 1 (doc_6, Retrieval-Augmented Generation): RAG can reduce hallucinations and allows models to use external knowledge.\n",
            "Source 2 (doc_8, Hallucination): Grounding answers in retrieved sources can reduce but not completely eliminate hallucinations.\n",
            "Source 3 (doc_8, Hallucination): In RAG systems, hallucinations may still happen if the retrieved context is irrelevant, incomplete, or ignored by the generator.\n",
            "\n",
            "Question: How does RAG reduce hallucinations?\n",
            "\n",
            "Answer:\n"
          ]
        }
      ],
      "source": [
        "# Cell HF-2 — Build a compact prompt for FLAN-T5\n",
        "#\n",
        "# Small local models have limited context windows and weaker instruction following\n",
        "# than large commercial LLMs. Therefore, the prompt should be short and explicit.\n",
        "\n",
        "def build_hf_rag_prompt(query, retrieved_results):\n",
        "    context_parts = []\n",
        "\n",
        "    for i, row in retrieved_results.iterrows():\n",
        "        context_parts.append(\n",
        "            f\"Source {i + 1} ({row['doc_id']}, {row['title']}): {row['text']}\"\n",
        "        )\n",
        "\n",
        "    context = \"\\n\".join(context_parts)\n",
        "\n",
        "    prompt = f\"\"\"\n",
        "Answer the question using only the context below.\n",
        "If the context does not contain the answer, say: I do not know based on the context.\n",
        "\n",
        "Context:\n",
        "{context}\n",
        "\n",
        "Question: {query}\n",
        "\n",
        "Answer:\n",
        "\"\"\"\n",
        "    return prompt.strip()\n",
        "\n",
        "\n",
        "query = \"How does RAG reduce hallucinations?\"\n",
        "retrieved = retrieve(query, vectorizer, chunk_vectors, chunks_df, top_k=3)\n",
        "\n",
        "hf_prompt = build_hf_rag_prompt(query, retrieved)\n",
        "\n",
        "print(hf_prompt)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "43d7e0c3",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 769
        },
        "id": "43d7e0c3",
        "outputId": "641e3b7c-767f-4669-b21b-6eac842fa486"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n",
            "Both `max_new_tokens` (=120) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Question:\n",
            "How does RAG reduce hallucinations?\n",
            "\n",
            "Generated answer:\n",
            "Answer the question using only the context below.\n",
            "If the context does not contain the answer, say: I do not know based on the context.\n",
            "\n",
            "Context:\n",
            "Source 1 (doc_6, Retrieval-Augmented Generation): RAG can reduce hallucinations and allows models to use external knowledge.\n",
            "Source 2 (doc_8, Hallucination): Grounding answers in retrieved sources can reduce but not completely eliminate hallucinations.\n",
            "Source 3 (doc_8, Hallucination): In RAG systems, hallucinations may still happen if the retrieved context is irrelevant, incomplete, or ignored by the generator.\n",
            "\n",
            "Question: How does RAG reduce hallucinations?\n",
            "\n",
            "Answer: RAG generates multiple reports for each hallucination according to the context it contains. The most common method is the use of an information retrieval system (or ANS) as described above. An information retrieval system is a system of retrieval and retrieval for the information contained in a given document. An ANS is a data retrieval system which uses a set of information stored on the computer as a base to retrieve information from as many sources as possible.\n",
            "\n",
            "Example:\n",
            "\n",
            "Ans: [1]\n",
            "\n",
            "[2]\n",
            "\n",
            "[3]\n",
            "\n",
            "[4]\n",
            "\n",
            "[5]\n",
            "\n",
            "Retrieved sources:\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "  doc_id                           title       chunk_id  similarity  \\\n",
              "0  doc_6  Retrieval-Augmented Generation  doc_6_chunk_3    0.463610   \n",
              "1  doc_8                   Hallucination  doc_8_chunk_2    0.311373   \n",
              "2  doc_8                   Hallucination  doc_8_chunk_1    0.261038   \n",
              "\n",
              "                                                text  \n",
              "0  RAG can reduce hallucinations and allows model...  \n",
              "1  Grounding answers in retrieved sources can red...  \n",
              "2  In RAG systems, hallucinations may still happe...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-b442a10b-18c5-483d-b920-bbbfe911e13a\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>doc_id</th>\n",
              "      <th>title</th>\n",
              "      <th>chunk_id</th>\n",
              "      <th>similarity</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>doc_6_chunk_3</td>\n",
              "      <td>0.463610</td>\n",
              "      <td>RAG can reduce hallucinations and allows model...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>doc_8</td>\n",
              "      <td>Hallucination</td>\n",
              "      <td>doc_8_chunk_2</td>\n",
              "      <td>0.311373</td>\n",
              "      <td>Grounding answers in retrieved sources can red...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>doc_8</td>\n",
              "      <td>Hallucination</td>\n",
              "      <td>doc_8_chunk_1</td>\n",
              "      <td>0.261038</td>\n",
              "      <td>In RAG systems, hallucinations may still happe...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b442a10b-18c5-483d-b920-bbbfe911e13a')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-b442a10b-18c5-483d-b920-bbbfe911e13a button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-b442a10b-18c5-483d-b920-bbbfe911e13a');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "summary": "{\n  \"name\": \"    display(sources)\",\n  \"rows\": 3,\n  \"fields\": [\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"doc_8\",\n          \"doc_6\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"Hallucination\",\n          \"Retrieval-Augmented Generation\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"chunk_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"doc_6_chunk_3\",\n          \"doc_8_chunk_2\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"similarity\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.10547140515449151,\n        \"min\": 0.26103799288786533,\n        \"max\": 0.46361029986060764,\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.46361029986060764,\n          0.31137321653390393\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"RAG can reduce hallucinations and allows models to use external knowledge.\",\n          \"Grounding answers in retrieved sources can reduce but not completely eliminate hallucinations.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {}
        }
      ],
      "source": [
        "# Cell HF-3 — Generate an answer with the Hugging Face model\n",
        "\n",
        "if hf_generator is not None:\n",
        "    output = hf_generator(hf_prompt)\n",
        "    generated_answer = output[0][\"generated_text\"]\n",
        "\n",
        "    print(\"Question:\")\n",
        "    print(query)\n",
        "\n",
        "    print(\"\\nGenerated answer:\")\n",
        "    print(generated_answer)\n",
        "\n",
        "    print(\"\\nRetrieved sources:\")\n",
        "    display(retrieved[[\"doc_id\", \"title\", \"chunk_id\", \"similarity\", \"text\"]])\n",
        "else:\n",
        "    print(\"Hugging Face generator is not available. Using extractive fallback instead.\")\n",
        "    answer, sources, results = simple_rag_answer(query, top_k=3)\n",
        "    print(answer)\n",
        "    display(sources)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8e01d8f4",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 646
        },
        "id": "8e01d8f4",
        "outputId": "f8ac1418-682d-435f-e484-42305fc619fb"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n",
            "Both `max_new_tokens` (=120) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Question:\n",
            "What is Retrieval-Augmented Generation?\n",
            "\n",
            "Answer:\n",
            "Answer the question using only the context below.\n",
            "If the context does not contain the answer, say: I do not know based on the context.\n",
            "\n",
            "Context:\n",
            "Source 1 (doc_6, Retrieval-Augmented Generation): Retrieval-Augmented Generation, or RAG, combines information retrieval with text generation.\n",
            "Source 2 (doc_7, Model Evaluation): For retrieval systems, we can measure whether relevant documents appear in the top retrieved results.\n",
            "Source 3 (doc_8, Hallucination): In RAG systems, hallucinations may still happen if the retrieved context is irrelevant, incomplete, or ignored by the generator.\n",
            "\n",
            "Question: What is Retrieval-Augmented Generation?\n",
            "\n",
            "Answer: Retrieval-Augmented Generation is a process that takes part in the task of retrieving documents from a database.\n",
            "\n",
            "In RAG systems, an RAG task involves processing the information that is stored in a database. The information is stored in the database, and the process proceeds as if the documents were retrieved from the database.\n",
            "\n",
            "Once the process has completed, the process returns the information to the database (which is a JSON object). The process has been run for five minutes, and finally, the process returns the document to the database.\n",
            "\n",
            "In the example below, the process\n",
            "\n",
            "Generation mode: hugging face generator\n",
            "\n",
            "Sources:\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "  doc_id                           title  similarity  \\\n",
              "0  doc_6  Retrieval-Augmented Generation    0.816646   \n",
              "1  doc_7                Model Evaluation    0.183025   \n",
              "2  doc_8                   Hallucination    0.000000   \n",
              "\n",
              "                                                text  \n",
              "0  Retrieval-Augmented Generation, or RAG, combin...  \n",
              "1  For retrieval systems, we can measure whether ...  \n",
              "2  In RAG systems, hallucinations may still happe...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-395d9137-9662-4008-be01-eae81cc55506\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>doc_id</th>\n",
              "      <th>title</th>\n",
              "      <th>similarity</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>0.816646</td>\n",
              "      <td>Retrieval-Augmented Generation, or RAG, combin...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>doc_7</td>\n",
              "      <td>Model Evaluation</td>\n",
              "      <td>0.183025</td>\n",
              "      <td>For retrieval systems, we can measure whether ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>doc_8</td>\n",
              "      <td>Hallucination</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>In RAG systems, hallucinations may still happe...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-395d9137-9662-4008-be01-eae81cc55506')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-395d9137-9662-4008-be01-eae81cc55506 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-395d9137-9662-4008-be01-eae81cc55506');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "summary": "{\n  \"name\": \"display(result[\\\"sources\\\"][[\\\"doc_id\\\", \\\"title\\\", \\\"similarity\\\", \\\"text\\\"]])\",\n  \"rows\": 3,\n  \"fields\": [\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"doc_6\",\n          \"doc_7\",\n          \"doc_8\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"Retrieval-Augmented Generation\",\n          \"Model Evaluation\",\n          \"Hallucination\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"similarity\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.42854125834384216,\n        \"min\": 0.0,\n        \"max\": 0.816646471895088,\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.816646471895088,\n          0.18302513283801927,\n          0.0\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"Retrieval-Augmented Generation, or RAG, combines information retrieval with text generation.\",\n          \"For retrieval systems, we can measure whether relevant documents appear in the top retrieved results.\",\n          \"In RAG systems, hallucinations may still happen if the retrieved context is irrelevant, incomplete, or ignored by the generator.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {}
        }
      ],
      "source": [
        "# Cell HF-4 — RAG function with Hugging Face generation and fallback\n",
        "\n",
        "\n",
        "def rag_answer_with_hf(query, top_k=3, threshold=0.15):\n",
        "    \"\"\"\n",
        "    RAG pipeline with optional Hugging Face generation.\n",
        "\n",
        "    Steps:\n",
        "    1. Retrieve top-k chunks.\n",
        "    2. Check similarity threshold.\n",
        "    3. Build prompt.\n",
        "    4. Generate answer using Hugging Face model if available.\n",
        "    5. Otherwise use extractive fallback.\n",
        "    \"\"\"\n",
        "    retrieved_results = retrieve(\n",
        "        query=query,\n",
        "        vectorizer=vectorizer,\n",
        "        chunk_vectors=chunk_vectors,\n",
        "        chunks_df=chunks_df,\n",
        "        top_k=top_k\n",
        "    )\n",
        "\n",
        "    best_similarity = retrieved_results[\"similarity\"].iloc[0]\n",
        "\n",
        "    if best_similarity < threshold:\n",
        "        return {\n",
        "            \"question\": query,\n",
        "            \"answer\": \"The provided knowledge base does not contain enough information to answer this question.\",\n",
        "            \"sources\": retrieved_results,\n",
        "            \"used_generator\": False,\n",
        "            \"reason\": \"best similarity below threshold\"\n",
        "        }\n",
        "\n",
        "    prompt = build_hf_rag_prompt(query, retrieved_results)\n",
        "\n",
        "    if hf_generator is not None:\n",
        "        output = hf_generator(prompt)\n",
        "        answer = output[0][\"generated_text\"]\n",
        "        used_generator = True\n",
        "        reason = \"hugging face generator\"\n",
        "    else:\n",
        "        answer = \" \".join(retrieved_results[\"text\"].tolist())\n",
        "        used_generator = False\n",
        "        reason = \"extractive fallback\"\n",
        "\n",
        "    return {\n",
        "        \"question\": query,\n",
        "        \"answer\": answer,\n",
        "        \"sources\": retrieved_results,\n",
        "        \"used_generator\": used_generator,\n",
        "        \"reason\": reason\n",
        "    }\n",
        "\n",
        "\n",
        "result = rag_answer_with_hf(\n",
        "    \"What is Retrieval-Augmented Generation?\",\n",
        "    top_k=3,\n",
        "    threshold=0.15\n",
        ")\n",
        "\n",
        "print(\"Question:\")\n",
        "print(result[\"question\"])\n",
        "\n",
        "print(\"\\nAnswer:\")\n",
        "print(result[\"answer\"])\n",
        "\n",
        "print(\"\\nGeneration mode:\", result[\"reason\"])\n",
        "\n",
        "print(\"\\nSources:\")\n",
        "display(result[\"sources\"][[\"doc_id\", \"title\", \"similarity\", \"text\"]])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "4e320862",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 917
        },
        "id": "4e320862",
        "outputId": "ce2713b0-0b9e-4117-8451-e79f48da58f6"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "====================================================================================================\n",
            "Question: What is TF-IDF?\n",
            "\n",
            "Extractive answer:\n",
            "TF-IDF is a weighting method for text representation. It is commonly used to compare TF-IDF vectors or embedding vectors. TF-IDF stands for Term Frequency Inverse Document Frequency.\n",
            "\n",
            "HF RAG answer:\n",
            "TF-IDF is a weighting method for text representation. It is commonly used to compare TF-IDF vectors or embedding vectors. TF-IDF stands for Term Frequency Inverse Document Frequency.\n",
            "\n",
            "Sources:\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "  doc_id              title  similarity  \\\n",
              "0  doc_2             TF-IDF    0.538304   \n",
              "1  doc_3  Cosine Similarity    0.442431   \n",
              "2  doc_2             TF-IDF    0.384244   \n",
              "\n",
              "                                                text  \n",
              "0  TF-IDF is a weighting method for text represen...  \n",
              "1  It is commonly used to compare TF-IDF vectors ...  \n",
              "2  TF-IDF stands for Term Frequency Inverse Docum...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-c698bbc9-4e0c-445d-89a0-5ee9963d4947\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>doc_id</th>\n",
              "      <th>title</th>\n",
              "      <th>similarity</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>doc_2</td>\n",
              "      <td>TF-IDF</td>\n",
              "      <td>0.538304</td>\n",
              "      <td>TF-IDF is a weighting method for text represen...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>doc_3</td>\n",
              "      <td>Cosine Similarity</td>\n",
              "      <td>0.442431</td>\n",
              "      <td>It is commonly used to compare TF-IDF vectors ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>doc_2</td>\n",
              "      <td>TF-IDF</td>\n",
              "      <td>0.384244</td>\n",
              "      <td>TF-IDF stands for Term Frequency Inverse Docum...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-c698bbc9-4e0c-445d-89a0-5ee9963d4947')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-c698bbc9-4e0c-445d-89a0-5ee9963d4947 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-c698bbc9-4e0c-445d-89a0-5ee9963d4947');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "summary": "{\n  \"name\": \"    display(hf_result[\\\"sources\\\"][[\\\"doc_id\\\", \\\"title\\\", \\\"similarity\\\", \\\"text\\\"]])\",\n  \"rows\": 3,\n  \"fields\": [\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"doc_3\",\n          \"doc_2\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"Cosine Similarity\",\n          \"TF-IDF\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"similarity\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.07779466023355308,\n        \"min\": 0.38424397493352536,\n        \"max\": 0.5383044158709056,\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.5383044158709056,\n          0.442431082968308\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"TF-IDF is a weighting method for text representation.\",\n          \"It is commonly used to compare TF-IDF vectors or embedding vectors.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "====================================================================================================\n",
            "Question: How do transformers use attention?\n",
            "\n",
            "Extractive answer:\n",
            "Transformers are neural network architectures based on the attention mechanism. RAG can reduce hallucinations and allows models to use external knowledge. Attention allows the model to focus on different tokens when building contextual representations.\n",
            "\n",
            "HF RAG answer:\n",
            "Transformers are neural network architectures based on the attention mechanism. RAG can reduce hallucinations and allows models to use external knowledge. Attention allows the model to focus on different tokens when building contextual representations.\n",
            "\n",
            "Sources:\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "  doc_id                           title  similarity  \\\n",
              "0  doc_5                    Transformers    0.390984   \n",
              "1  doc_6  Retrieval-Augmented Generation    0.239837   \n",
              "2  doc_5                    Transformers    0.177442   \n",
              "\n",
              "                                                text  \n",
              "0  Transformers are neural network architectures ...  \n",
              "1  RAG can reduce hallucinations and allows model...  \n",
              "2  Attention allows the model to focus on differe...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-2cda9e3d-8654-4c9b-92f5-daffe437f486\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>doc_id</th>\n",
              "      <th>title</th>\n",
              "      <th>similarity</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>doc_5</td>\n",
              "      <td>Transformers</td>\n",
              "      <td>0.390984</td>\n",
              "      <td>Transformers are neural network architectures ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>0.239837</td>\n",
              "      <td>RAG can reduce hallucinations and allows model...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>doc_5</td>\n",
              "      <td>Transformers</td>\n",
              "      <td>0.177442</td>\n",
              "      <td>Attention allows the model to focus on differe...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2cda9e3d-8654-4c9b-92f5-daffe437f486')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-2cda9e3d-8654-4c9b-92f5-daffe437f486 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-2cda9e3d-8654-4c9b-92f5-daffe437f486');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "summary": "{\n  \"name\": \"    display(hf_result[\\\"sources\\\"][[\\\"doc_id\\\", \\\"title\\\", \\\"similarity\\\", \\\"text\\\"]])\",\n  \"rows\": 3,\n  \"fields\": [\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"doc_6\",\n          \"doc_5\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"Retrieval-Augmented Generation\",\n          \"Transformers\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"similarity\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.10980192820720955,\n        \"min\": 0.17744158773172275,\n        \"max\": 0.39098369639856306,\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.39098369639856306,\n          0.23983679070820418\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"Transformers are neural network architectures based on the attention mechanism.\",\n          \"RAG can reduce hallucinations and allows models to use external knowledge.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "====================================================================================================\n",
            "Question: Why can language models hallucinate?\n",
            "\n",
            "Extractive answer:\n",
            "A hallucination occurs when a language model generates information that is not supported by the available evidence. Machine learning models should be evaluated on unseen data. RAG can reduce hallucinations and allows models to use external knowledge.\n",
            "\n",
            "HF RAG answer:\n",
            "A hallucination occurs when a language model generates information that is not supported by the available evidence. Machine learning models should be evaluated on unseen data. RAG can reduce hallucinations and allows models to use external knowledge.\n",
            "\n",
            "Sources:\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "  doc_id                           title  similarity  \\\n",
              "0  doc_8                   Hallucination    0.258069   \n",
              "1  doc_7                Model Evaluation    0.246489   \n",
              "2  doc_6  Retrieval-Augmented Generation    0.229025   \n",
              "\n",
              "                                                text  \n",
              "0  A hallucination occurs when a language model g...  \n",
              "1  Machine learning models should be evaluated on...  \n",
              "2  RAG can reduce hallucinations and allows model...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-02c5f83c-86c5-47e1-a467-3184d4282979\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>doc_id</th>\n",
              "      <th>title</th>\n",
              "      <th>similarity</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>doc_8</td>\n",
              "      <td>Hallucination</td>\n",
              "      <td>0.258069</td>\n",
              "      <td>A hallucination occurs when a language model g...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>doc_7</td>\n",
              "      <td>Model Evaluation</td>\n",
              "      <td>0.246489</td>\n",
              "      <td>Machine learning models should be evaluated on...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>doc_6</td>\n",
              "      <td>Retrieval-Augmented Generation</td>\n",
              "      <td>0.229025</td>\n",
              "      <td>RAG can reduce hallucinations and allows model...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-02c5f83c-86c5-47e1-a467-3184d4282979')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-02c5f83c-86c5-47e1-a467-3184d4282979 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-02c5f83c-86c5-47e1-a467-3184d4282979');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "dataframe",
              "summary": "{\n  \"name\": \"    display(hf_result[\\\"sources\\\"][[\\\"doc_id\\\", \\\"title\\\", \\\"similarity\\\", \\\"text\\\"]])\",\n  \"rows\": 3,\n  \"fields\": [\n    {\n      \"column\": \"doc_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"doc_8\",\n          \"doc_7\",\n          \"doc_6\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"Hallucination\",\n          \"Model Evaluation\",\n          \"Retrieval-Augmented Generation\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"similarity\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.014621008862250855,\n        \"min\": 0.22902512499888336,\n        \"max\": 0.258069228995511,\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.258069228995511,\n          0.24648855723865923,\n          0.22902512499888336\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          \"A hallucination occurs when a language model generates information that is not supported by the available evidence.\",\n          \"Machine learning models should be evaluated on unseen data.\",\n          \"RAG can reduce hallucinations and allows models to use external knowledge.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
            }
          },
          "metadata": {}
        }
      ],
      "source": [
        "# Cell HF-5 — TODO: Compare generated and extractive RAG answers\n",
        "#\n",
        "# TODO:\n",
        "# 1. Choose three questions.\n",
        "# 2. Generate answers using `simple_rag_answer`.\n",
        "# 3. Generate answers using `rag_answer_with_hf`.\n",
        "# 4. Compare the results.\n",
        "#\n",
        "# Questions:\n",
        "# - Is the generated answer more readable?\n",
        "# - Does it add unsupported information?\n",
        "# - Are the cited/retrieved sources actually relevant?\n",
        "# - Does the answer change when top_k changes?\n",
        "\n",
        "comparison_questions = [\n",
        "    \"What is TF-IDF?\",\n",
        "    \"How do transformers use attention?\",\n",
        "    \"Why can language models hallucinate?\"\n",
        "]\n",
        "\n",
        "for q in comparison_questions:\n",
        "    print(\"=\" * 100)\n",
        "    print(\"Question:\", q)\n",
        "\n",
        "    extractive_answer, extractive_sources, _ = simple_rag_answer(q, top_k=3)\n",
        "    hf_result = rag_answer_with_hf(q, top_k=3, threshold=0.15)\n",
        "\n",
        "    print(\"\\nExtractive answer:\")\n",
        "    print(extractive_answer)\n",
        "\n",
        "    print(\"\\nHF RAG answer:\")\n",
        "    print(hf_result[\"answer\"])\n",
        "\n",
        "    print(\"\\nSources:\")\n",
        "    display(hf_result[\"sources\"][[\"doc_id\", \"title\", \"similarity\", \"text\"]])"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "73f2ed94",
      "metadata": {
        "id": "73f2ed94"
      },
      "source": [
        "## Discussion: what changed after adding the Hugging Face model?\n",
        "\n",
        "The retriever and the generator solve different problems.\n",
        "\n",
        "The **retriever** decides which chunks are relevant.\n",
        "\n",
        "The **generator** turns retrieved chunks into a natural-language answer.\n",
        "\n",
        "This means that RAG may fail in at least two ways:\n",
        "\n",
        "1. **retrieval failure** — the system retrieves irrelevant or incomplete context,\n",
        "2. **generation failure** — the model ignores, distorts, or over-interprets the retrieved context.\n",
        "\n",
        "A small Hugging Face model may produce shorter and less fluent answers than a large LLM, but it is useful for teaching because students can run and inspect the whole pipeline."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f23ed3ad",
      "metadata": {
        "id": "f23ed3ad"
      },
      "source": [
        "# 9. What if retrieval fails?\n",
        "\n",
        "A RAG system can fail even if the language model is strong.\n",
        "\n",
        "Possible failure modes:\n",
        "\n",
        "1. The relevant document is not in the knowledge base.\n",
        "2. The relevant chunk exists but is not retrieved.\n",
        "3. The query uses different words than the document.\n",
        "4. The retrieved context is partially relevant but incomplete.\n",
        "5. The generator ignores the context.\n",
        "6. The generator adds unsupported claims.\n",
        "\n",
        "We will now test a question whose answer is not present in the knowledge base."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "79f3e9c9",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "79f3e9c9",
        "outputId": "2318a751-840b-472d-fef3-5737dce2d518"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Query:\n",
            "What is the capital city of Poland?\n",
            "\n",
            "Retrieved chunks:\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 1\n",
            "Source: doc_8 — Hallucination\n",
            "Similarity: 0.000\n",
            "Text: Grounding answers in retrieved sources can reduce but not completely eliminate hallucinations.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 2\n",
            "Source: doc_8 — Hallucination\n",
            "Similarity: 0.000\n",
            "Text: In RAG systems, hallucinations may still happen if the retrieved context is irrelevant, incomplete, or ignored by the generator.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 3\n",
            "Source: doc_8 — Hallucination\n",
            "Similarity: 0.000\n",
            "Text: A hallucination occurs when a language model generates information that is not supported by the available evidence.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 4\n",
            "Source: doc_7 — Model Evaluation\n",
            "Similarity: 0.000\n",
            "Text: Evaluation helps compare different model configurations.\n",
            "--------------------------------------------------------------------------------\n",
            "Rank 5\n",
            "Source: doc_7 — Model Evaluation\n",
            "Similarity: 0.000\n",
            "Text: For retrieval systems, we can measure whether relevant documents appear in the top retrieved results.\n",
            "--------------------------------------------------------------------------------\n"
          ]
        }
      ],
      "source": [
        "# Cell 12 — Out-of-scope question\n",
        "\n",
        "query = \"What is the capital city of Poland?\"\n",
        "\n",
        "retrieved = retrieve(query, vectorizer, chunk_vectors, chunks_df, top_k=5)\n",
        "\n",
        "print_retrieved_results(query, retrieved)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a402ad40",
      "metadata": {
        "id": "a402ad40"
      },
      "source": [
        "## Discussion\n",
        "\n",
        "The system always retrieves something, even if the question is unrelated.\n",
        "\n",
        "This is dangerous.\n",
        "\n",
        "A real RAG system should often use a similarity threshold:\n",
        "\n",
        "```text\n",
        "if best_similarity < threshold:\n",
        "    say \"I do not have enough information\"\n",
        "```\n",
        "\n",
        "This reduces the risk of unsupported answers."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1a019a9d",
      "metadata": {
        "id": "1a019a9d"
      },
      "outputs": [],
      "source": [
        "# Cell 13 — Retrieval with similarity threshold\n",
        "\n",
        "\n",
        "def simple_rag_answer_with_threshold(query, top_k=3, threshold=0.15):\n",
        "    results = retrieve(\n",
        "        query=query,\n",
        "        vectorizer=vectorizer,\n",
        "        chunk_vectors=chunk_vectors,\n",
        "        chunks_df=chunks_df,\n",
        "        top_k=top_k\n",
        "    )\n",
        "\n",
        "    best_similarity = results[\"similarity\"].iloc[0]\n",
        "\n",
        "    if best_similarity < threshold:\n",
        "        answer = \"The provided knowledge base does not contain enough information to answer this question.\"\n",
        "        sources = results[[\"doc_id\", \"title\", \"chunk_id\", \"similarity\"]]\n",
        "        return answer, sources, results\n",
        "\n",
        "    answer = \" \".join(results[\"text\"].tolist())\n",
        "    sources = results[[\"doc_id\", \"title\", \"chunk_id\", \"similarity\"]]\n",
        "\n",
        "    return answer, sources, results\n",
        "\n",
        "\n",
        "queries = [\n",
        "    \"What is RAG?\",\n",
        "    \"What is the capital city of Poland?\"\n",
        "]\n",
        "\n",
        "for q in queries:\n",
        "    answer, sources, results = simple_rag_answer_with_threshold(q, top_k=3, threshold=0.15)\n",
        "\n",
        "    print(\"=\" * 100)\n",
        "    print(\"Question:\", q)\n",
        "    print(\"\\nAnswer:\")\n",
        "    print(answer)\n",
        "    print(\"\\nSources:\")\n",
        "    display(sources)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e87f672f",
      "metadata": {
        "id": "e87f672f"
      },
      "outputs": [],
      "source": [
        "# Cell 14 — TODO: Experiment with threshold\n",
        "#\n",
        "# TODO:\n",
        "# 1. Try different threshold values: 0.05, 0.10, 0.15, 0.25, 0.35.\n",
        "# 2. Test both in-domain and out-of-domain questions.\n",
        "# 3. Observe when the system refuses to answer.\n",
        "\n",
        "thresholds = [0.05, 0.10, 0.15, 0.25, 0.35]\n",
        "\n",
        "test_query = \"What is the capital city of Poland?\"\n",
        "\n",
        "for threshold in thresholds:\n",
        "    answer, sources, results = simple_rag_answer_with_threshold(\n",
        "        test_query,\n",
        "        top_k=3,\n",
        "        threshold=threshold\n",
        "    )\n",
        "\n",
        "    print(\"Threshold:\", threshold)\n",
        "    print(\"Best similarity:\", results[\"similarity\"].iloc[0])\n",
        "    print(\"Answer:\", answer)\n",
        "    print()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "cd9649ad",
      "metadata": {
        "id": "cd9649ad"
      },
      "source": [
        "# 10. Retrieval evaluation\n",
        "\n",
        "To evaluate retrieval, we need questions with known relevant documents.\n",
        "\n",
        "For each question, we define which document should be retrieved.\n",
        "\n",
        "Example:\n",
        "\n",
        "```text\n",
        "Question: \"What is TF-IDF?\"\n",
        "Relevant document: doc_2\n",
        "```\n",
        "\n",
        "A simple metric is **Hit@k**:\n",
        "\n",
        "```text\n",
        "Hit@k = 1 if at least one relevant document appears in top-k results\n",
        "Hit@k = 0 otherwise\n",
        "```\n",
        "\n",
        "For a test set of questions, we compute the average Hit@k."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d610231d",
      "metadata": {
        "id": "d610231d"
      },
      "outputs": [],
      "source": [
        "# Cell 15 — Define a small retrieval evaluation set\n",
        "\n",
        "evaluation_questions = [\n",
        "    {\"question\": \"What is Bag of Words?\", \"relevant_doc_id\": \"doc_1\"},\n",
        "    {\"question\": \"What does TF-IDF mean?\", \"relevant_doc_id\": \"doc_2\"},\n",
        "    {\"question\": \"How can we compare vectors?\", \"relevant_doc_id\": \"doc_3\"},\n",
        "    {\"question\": \"What are word embeddings?\", \"relevant_doc_id\": \"doc_4\"},\n",
        "    {\"question\": \"What is attention in transformers?\", \"relevant_doc_id\": \"doc_5\"},\n",
        "    {\"question\": \"How does RAG work?\", \"relevant_doc_id\": \"doc_6\"},\n",
        "    {\"question\": \"How do we evaluate classifiers?\", \"relevant_doc_id\": \"doc_7\"},\n",
        "    {\"question\": \"What is hallucination?\", \"relevant_doc_id\": \"doc_8\"}\n",
        "]\n",
        "\n",
        "evaluation_df = pd.DataFrame(evaluation_questions)\n",
        "evaluation_df"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "3cb162ea",
      "metadata": {
        "id": "3cb162ea"
      },
      "outputs": [],
      "source": [
        "# Cell 16 — Hit@k evaluation\n",
        "\n",
        "\n",
        "def hit_at_k(question, relevant_doc_id, k):\n",
        "    results = retrieve(\n",
        "        query=question,\n",
        "        vectorizer=vectorizer,\n",
        "        chunk_vectors=chunk_vectors,\n",
        "        chunks_df=chunks_df,\n",
        "        top_k=k\n",
        "    )\n",
        "\n",
        "    retrieved_doc_ids = set(results[\"doc_id\"].tolist())\n",
        "\n",
        "    return int(relevant_doc_id in retrieved_doc_ids)\n",
        "\n",
        "\n",
        "def evaluate_hit_at_k(evaluation_questions, k):\n",
        "    hits = []\n",
        "\n",
        "    for item in evaluation_questions:\n",
        "        hit = hit_at_k(\n",
        "            question=item[\"question\"],\n",
        "            relevant_doc_id=item[\"relevant_doc_id\"],\n",
        "            k=k\n",
        "        )\n",
        "        hits.append(hit)\n",
        "\n",
        "    return sum(hits) / len(hits)\n",
        "\n",
        "\n",
        "for k in [1, 2, 3, 5]:\n",
        "    score = evaluate_hit_at_k(evaluation_questions, k=k)\n",
        "    print(f\"Hit@{k}: {score:.3f}\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a6a148e3",
      "metadata": {
        "id": "a6a148e3"
      },
      "outputs": [],
      "source": [
        "# Cell 17 — Detailed retrieval evaluation\n",
        "\n",
        "rows = []\n",
        "\n",
        "for item in evaluation_questions:\n",
        "    question = item[\"question\"]\n",
        "    relevant_doc_id = item[\"relevant_doc_id\"]\n",
        "\n",
        "    results = retrieve(\n",
        "        query=question,\n",
        "        vectorizer=vectorizer,\n",
        "        chunk_vectors=chunk_vectors,\n",
        "        chunks_df=chunks_df,\n",
        "        top_k=3\n",
        "    )\n",
        "\n",
        "    retrieved_doc_ids = results[\"doc_id\"].tolist()\n",
        "\n",
        "    rows.append({\n",
        "        \"question\": question,\n",
        "        \"relevant_doc_id\": relevant_doc_id,\n",
        "        \"top_1_doc\": retrieved_doc_ids[0],\n",
        "        \"top_3_docs\": retrieved_doc_ids,\n",
        "        \"hit_at_1\": int(relevant_doc_id == retrieved_doc_ids[0]),\n",
        "        \"hit_at_3\": int(relevant_doc_id in retrieved_doc_ids)\n",
        "    })\n",
        "\n",
        "retrieval_eval_df = pd.DataFrame(rows)\n",
        "retrieval_eval_df"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2b66281c",
      "metadata": {
        "id": "2b66281c"
      },
      "outputs": [],
      "source": [
        "# Cell 18 — TODO: Add your own evaluation questions\n",
        "#\n",
        "# TODO:\n",
        "# 1. Add at least three new evaluation questions.\n",
        "# 2. Assign relevant document IDs.\n",
        "# 3. Recompute Hit@1 and Hit@3.\n",
        "\n",
        "my_evaluation_questions = evaluation_questions + [\n",
        "    {\"question\": \"Which method uses local frequency and global rarity?\", \"relevant_doc_id\": \"doc_2\"},\n",
        "    {\"question\": \"Which architecture uses attention?\", \"relevant_doc_id\": \"doc_5\"},\n",
        "    {\"question\": \"What can happen when a model invents unsupported facts?\", \"relevant_doc_id\": \"doc_8\"}\n",
        "]\n",
        "\n",
        "for k in [1, 3]:\n",
        "    score = evaluate_hit_at_k(my_evaluation_questions, k=k)\n",
        "    print(f\"Hit@{k}: {score:.3f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "2592c848",
      "metadata": {
        "id": "2592c848"
      },
      "source": [
        "# 11. Improving retrieval with n-grams\n",
        "\n",
        "TF-IDF with unigrams may miss important phrases.\n",
        "\n",
        "For example:\n",
        "\n",
        "```text\n",
        "word embeddings\n",
        "cosine similarity\n",
        "language model\n",
        "retrieval augmented generation\n",
        "```\n",
        "\n",
        "Using n-grams can help represent short phrases.\n",
        "\n",
        "We now compare retrieval using:\n",
        "\n",
        "1. unigrams only,\n",
        "2. unigrams and bigrams."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "0d44bbcc",
      "metadata": {
        "id": "0d44bbcc"
      },
      "outputs": [],
      "source": [
        "# Cell 19 — TF-IDF retriever with n-grams\n",
        "\n",
        "ngram_vectorizer = TfidfVectorizer(\n",
        "    lowercase=True,\n",
        "    stop_words=\"english\",\n",
        "    ngram_range=(1, 2)\n",
        ")\n",
        "\n",
        "ngram_chunk_vectors = ngram_vectorizer.fit_transform(chunk_texts)\n",
        "\n",
        "print(\"Unigram features:\", chunk_vectors.shape[1])\n",
        "print(\"Unigram + bigram features:\", ngram_chunk_vectors.shape[1])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5ecf3dd6",
      "metadata": {
        "id": "5ecf3dd6"
      },
      "outputs": [],
      "source": [
        "# Cell 20 — Compare retrieval with and without bigrams\n",
        "\n",
        "test_queries = [\n",
        "    \"What are word embeddings?\",\n",
        "    \"What is cosine similarity?\",\n",
        "    \"What is retrieval augmented generation?\",\n",
        "    \"How do language models hallucinate?\"\n",
        "]\n",
        "\n",
        "for q in test_queries:\n",
        "    print(\"=\" * 100)\n",
        "    print(\"Query:\", q)\n",
        "\n",
        "    print(\"\\nUnigram retriever:\")\n",
        "    unigram_results = retrieve(q, vectorizer, chunk_vectors, chunks_df, top_k=3)\n",
        "    display(unigram_results[[\"doc_id\", \"title\", \"text\", \"similarity\"]])\n",
        "\n",
        "    print(\"\\nUnigram + bigram retriever:\")\n",
        "    ngram_results = retrieve(q, ngram_vectorizer, ngram_chunk_vectors, chunks_df, top_k=3)\n",
        "    display(ngram_results[[\"doc_id\", \"title\", \"text\", \"similarity\"]])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "047c53a6",
      "metadata": {
        "id": "047c53a6"
      },
      "outputs": [],
      "source": [
        "# Cell 21 — TODO: Evaluate n-gram retrieval\n",
        "#\n",
        "# TODO:\n",
        "# 1. Create a modified version of hit_at_k that uses ngram_vectorizer and ngram_chunk_vectors.\n",
        "# 2. Compute Hit@1 and Hit@3.\n",
        "# 3. Compare with the unigram retriever.\n",
        "\n",
        "\n",
        "def hit_at_k_ngram(question, relevant_doc_id, k):\n",
        "    results = retrieve(\n",
        "        query=question,\n",
        "        vectorizer=ngram_vectorizer,\n",
        "        chunk_vectors=ngram_chunk_vectors,\n",
        "        chunks_df=chunks_df,\n",
        "        top_k=k\n",
        "    )\n",
        "\n",
        "    retrieved_doc_ids = set(results[\"doc_id\"].tolist())\n",
        "\n",
        "    return int(relevant_doc_id in retrieved_doc_ids)\n",
        "\n",
        "\n",
        "def evaluate_hit_at_k_ngram(evaluation_questions, k):\n",
        "    hits = []\n",
        "\n",
        "    for item in evaluation_questions:\n",
        "        hit = hit_at_k_ngram(\n",
        "            question=item[\"question\"],\n",
        "            relevant_doc_id=item[\"relevant_doc_id\"],\n",
        "            k=k\n",
        "        )\n",
        "        hits.append(hit)\n",
        "\n",
        "    return sum(hits) / len(hits)\n",
        "\n",
        "\n",
        "print(\"Unigram retriever:\")\n",
        "for k in [1, 3]:\n",
        "    print(f\"Hit@{k}: {evaluate_hit_at_k(evaluation_questions, k=k):.3f}\")\n",
        "\n",
        "print(\"\\nUnigram + bigram retriever:\")\n",
        "for k in [1, 3]:\n",
        "    print(f\"Hit@{k}: {evaluate_hit_at_k_ngram(evaluation_questions, k=k):.3f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "2eba8982",
      "metadata": {
        "id": "2eba8982"
      },
      "source": [
        "# 12. Optional: semantic embeddings\n",
        "\n",
        "So far, we used TF-IDF. This is simple and transparent, but it mostly relies on word overlap.\n",
        "\n",
        "Modern RAG systems usually use dense embeddings.\n",
        "\n",
        "Dense embedding retrievers can find semantically similar chunks even when the query and document do not use exactly the same words.\n",
        "\n",
        "For example:\n",
        "\n",
        "```text\n",
        "query: \"How can we compare text meaning?\"\n",
        "chunk: \"Cosine similarity measures the angle between vectors.\"\n",
        "```\n",
        "\n",
        "A TF-IDF retriever may struggle if there is little word overlap. An embedding retriever may perform better if trained well.\n",
        "\n",
        "The following optional section uses `sentence-transformers` if available in your environment.\n",
        "\n",
        "If it is not installed, you can skip this section."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "421af233",
      "metadata": {
        "id": "421af233"
      },
      "outputs": [],
      "source": [
        "# Cell 22 — Optional semantic embeddings\n",
        "#\n",
        "# This cell is optional.\n",
        "# It requires the sentence-transformers package.\n",
        "#\n",
        "# In Google Colab you may need:\n",
        "# !pip install sentence-transformers\n",
        "\n",
        "try:\n",
        "    from sentence_transformers import SentenceTransformer\n",
        "\n",
        "    embedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n",
        "\n",
        "    dense_chunk_vectors = embedding_model.encode(\n",
        "        chunk_texts,\n",
        "        normalize_embeddings=True\n",
        "    )\n",
        "\n",
        "    print(\"Dense chunk vectors shape:\", dense_chunk_vectors.shape)\n",
        "\n",
        "except Exception as e:\n",
        "    embedding_model = None\n",
        "    dense_chunk_vectors = None\n",
        "    print(\"sentence-transformers is not available.\")\n",
        "    print(\"You can skip the optional dense embedding section.\")\n",
        "    print(\"Error:\", e)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "9555468e",
      "metadata": {
        "id": "9555468e"
      },
      "outputs": [],
      "source": [
        "# Cell 23 — Optional dense retrieval function\n",
        "\n",
        "\n",
        "def retrieve_dense(query, top_k=3):\n",
        "    if embedding_model is None or dense_chunk_vectors is None:\n",
        "        raise RuntimeError(\"Dense embedding model is not available.\")\n",
        "\n",
        "    query_vector = embedding_model.encode(\n",
        "        [query],\n",
        "        normalize_embeddings=True\n",
        "    )\n",
        "\n",
        "    similarities = cosine_similarity(query_vector, dense_chunk_vectors)[0]\n",
        "    top_indices = np.argsort(similarities)[::-1][:top_k]\n",
        "\n",
        "    results = chunks_df.iloc[top_indices].copy()\n",
        "    results[\"similarity\"] = similarities[top_indices]\n",
        "\n",
        "    return results.reset_index(drop=True)\n",
        "\n",
        "\n",
        "if embedding_model is not None:\n",
        "    query = \"How can we compare the meaning of texts?\"\n",
        "    dense_results = retrieve_dense(query, top_k=5)\n",
        "    print_retrieved_results(query, dense_results)\n",
        "else:\n",
        "    print(\"Dense retrieval skipped.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "018ea1ce",
      "metadata": {
        "id": "018ea1ce"
      },
      "source": [
        "# 13. RAG design questions\n",
        "\n",
        "A real RAG system requires many design decisions.\n",
        "\n",
        "Important questions include:\n",
        "\n",
        "1. What documents should be included in the knowledge base?\n",
        "2. How should documents be cleaned?\n",
        "3. How should documents be chunked?\n",
        "4. Which embedding model should be used?\n",
        "5. How many chunks should be retrieved?\n",
        "6. Should we use similarity thresholds?\n",
        "7. Should we use reranking?\n",
        "8. How should we evaluate retrieval quality?\n",
        "9. How should we evaluate answer quality?\n",
        "10. How should we cite sources?\n",
        "11. How should the system behave when it does not know the answer?"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1b8cfd1b",
      "metadata": {
        "id": "1b8cfd1b"
      },
      "source": [
        "# 14. Main task\n",
        "\n",
        "## Task: Build and evaluate a small RAG system\n",
        "\n",
        "Using the code from this lab, complete the following steps:\n",
        "\n",
        "1. Add at least three new documents to the knowledge base.\n",
        "2. Recreate sentence chunks.\n",
        "3. Refit the TF-IDF vectorizer.\n",
        "4. Write at least five user questions.\n",
        "5. Retrieve top-3 chunks for each question.\n",
        "6. Build a RAG prompt for each question.\n",
        "7. Generate or simulate answers using retrieved context.\n",
        "8. Add a similarity threshold.\n",
        "9. Define relevant document IDs for your questions.\n",
        "10. Compute Hit@1 and Hit@3.\n",
        "\n",
        "## Questions to answer\n",
        "\n",
        "1. Which questions were answered well?\n",
        "2. Which questions failed?\n",
        "3. Were failures caused by poor retrieval or missing information?\n",
        "4. How did changing `top_k` affect the answer?\n",
        "5. How did changing the similarity threshold affect the answer?\n",
        "6. What would you improve in this RAG pipeline?"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c5aa12f2",
      "metadata": {
        "id": "c5aa12f2"
      },
      "source": [
        "# 15. Summary\n",
        "\n",
        "In this lab, we built a simple RAG pipeline.\n",
        "\n",
        "The pipeline was:\n",
        "\n",
        "```text\n",
        "documents\n",
        "-> chunks\n",
        "-> vectorization\n",
        "-> query vectorization\n",
        "-> cosine similarity\n",
        "-> top-k retrieval\n",
        "-> context construction\n",
        "-> grounded answer\n",
        "```\n",
        "\n",
        "## Key ideas\n",
        "\n",
        "1. RAG combines retrieval and generation.\n",
        "2. Retrieval quality is critical for answer quality.\n",
        "3. Chunking is a key design choice.\n",
        "4. TF-IDF retrieval is simple and transparent.\n",
        "5. Dense embeddings can improve semantic retrieval.\n",
        "6. Similarity thresholds can reduce unsupported answers.\n",
        "7. Evaluation should include both retrieval metrics and answer quality.\n",
        "8. RAG can reduce hallucinations, but it cannot completely eliminate them.\n",
        "\n",
        "## Mathematical concepts used\n",
        "\n",
        "- vector representation,\n",
        "- sparse vectors,\n",
        "- dense vectors,\n",
        "- cosine similarity,\n",
        "- ranking,\n",
        "- top-k retrieval,\n",
        "- thresholding,\n",
        "- evaluation metrics such as Hit@k."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.11"
    },
    "colab": {
      "provenance": []
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "4bad55bd31334bd78bc1ce356a5bda0e": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HBoxModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_44e84d091f1144f4b370949ea8a34ada",
              "IPY_MODEL_315b1318f15e4ea58a55374e721c9769",
              "IPY_MODEL_8f2e0cb009854b658968b4f9e0e93f70"
            ],
            "layout": "IPY_MODEL_4dd91cff9b084c99b78c4e9f5ebb7022"
          }
        },
        "44e84d091f1144f4b370949ea8a34ada": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HTMLModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_4f4a33c72f424d32ac6cecb8bd2300fe",
            "placeholder": "​",
            "style": "IPY_MODEL_ad72f9fb43854265b213f127c5060cdf",
            "value": "Loading weights: 100%"
          }
        },
        "315b1318f15e4ea58a55374e721c9769": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "FloatProgressModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "success",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_331976f66e4e4741bd676550853c1bc6",
            "max": 148,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_d6d5db77d26a4d5d89d83cd34f2fa16e",
            "value": 148
          }
        },
        "8f2e0cb009854b658968b4f9e0e93f70": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HTMLModel",
          "model_module_version": "1.5.0",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_40391c81c47b4819badf919245fec20e",
            "placeholder": "​",
            "style": "IPY_MODEL_bbc635adecef43bb9dd2376d95592712",
            "value": " 148/148 [00:00&lt;00:00, 477.93it/s, Materializing param=transformer.wte.weight]"
          }
        },
        "4dd91cff9b084c99b78c4e9f5ebb7022": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "4f4a33c72f424d32ac6cecb8bd2300fe": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "ad72f9fb43854265b213f127c5060cdf": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "DescriptionStyleModel",
          "model_module_version": "1.5.0",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "331976f66e4e4741bd676550853c1bc6": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "d6d5db77d26a4d5d89d83cd34f2fa16e": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "ProgressStyleModel",
          "model_module_version": "1.5.0",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "40391c81c47b4819badf919245fec20e": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "model_module_version": "1.2.0",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "bbc635adecef43bb9dd2376d95592712": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "DescriptionStyleModel",
          "model_module_version": "1.5.0",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        }
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}