{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c9cd197e",
   "metadata": {},
   "source": [
    "# Prepare Training File: Load Model & Generate Training Pairs\n",
    "\n",
    "This notebook loads a language model and uses it to generate structured instruction/response training pairs from any input file. The generated pairs can be used directly for fine-tuning."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "556d3fe5",
   "metadata": {},
   "source": [
    "## Setup: Environment Variables\n",
    "\n",
    "Configure CUDA and PyTorch environment variables to disable BF16 and FP16 precision reductions for stable training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a25b6a3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"CUDA_DISABLE_BF16\"] = \"1\"\n",
    "os.environ[\"TORCH_CUDA_ALLOW_BF16_REDUCED_PRECISION_REDUCTION\"] = \"0\"\n",
    "os.environ[\"ACCELERATE_DISABLE_FP16\"] = \"1\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97b9e212",
   "metadata": {},
   "source": [
    "## Setup: Import Required Libraries\n",
    "\n",
    "Import necessary libraries including transformers, torch, datasets, python-docx, json, os, and other utilities for document processing and model loading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d63d552",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import logging\n",
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "from docx import Document\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
    "import torch\n",
    "\n",
    "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')\n",
    "logger = logging.getLogger(__name__)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84e04da2",
   "metadata": {},
   "source": [
    "## Setup: Configure Directory Structure\n",
    "\n",
    "Create and organize directory paths for storing training data, models, and intermediate outputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "993ed003",
   "metadata": {},
   "outputs": [],
   "source": [
    "OUTPUT_DIR = Path(\"./build/training_prep\")\n",
    "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
    "DATA_DIR = OUTPUT_DIR / \"data\"\n",
    "DATA_DIR.mkdir(exist_ok=True)\n",
    "MODELS_DIR = OUTPUT_DIR / \"models\"\n",
    "MODELS_DIR.mkdir(exist_ok=True)\n",
    "\n",
    "MODEL_CACHE_DIR = Path(\"./model/base-model\")\n",
    "MODEL_CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
    "os.environ[\"HF_HOME\"] = str(MODEL_CACHE_DIR)\n",
    "\n",
    "logger.info(f\"Output directory: {OUTPUT_DIR}\")\n",
    "logger.info(f\"Model cache directory: {MODEL_CACHE_DIR}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0439c534",
   "metadata": {},
   "source": [
    "## Setup: Helper Functions\n",
    "\n",
    "Define utility functions for loading various file formats (DOCX, JSON, JSONL)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e34ff2b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_docx_file(file_path: str) -> list:\n",
    "    \"\"\"Load and parse a DOCX file into paragraphs.\"\"\"\n",
    "    logger.info(f\"Loading DOCX file: {file_path}\")\n",
    "    doc = Document(file_path)\n",
    "    paragraphs = [p.text.strip() for p in doc.paragraphs if p.text.strip()]\n",
    "    logger.info(f\"Extracted {len(paragraphs)} paragraphs from {file_path}\")\n",
    "    return paragraphs\n",
    "\n",
    "\n",
    "def load_json_file(file_path: str) -> list:\n",
    "    \"\"\"Load a JSON file (array or object).\"\"\"\n",
    "    logger.info(f\"Loading JSON file: {file_path}\")\n",
    "    with open(file_path, 'r', encoding='utf-8') as f:\n",
    "        data = json.load(f)\n",
    "    if isinstance(data, list):\n",
    "        logger.info(f\"Loaded {len(data)} items from JSON file\")\n",
    "        return data\n",
    "    elif isinstance(data, dict):\n",
    "        logger.info(f\"JSON file is dict, converting to list\")\n",
    "        return [data]\n",
    "    return []\n",
    "\n",
    "\n",
    "def load_jsonl_file(file_path: str) -> list:\n",
    "    \"\"\"Load a JSONL file (one JSON object per line).\"\"\"\n",
    "    logger.info(f\"Loading JSONL file: {file_path}\")\n",
    "    items = []\n",
    "    with open(file_path, 'r', encoding='utf-8') as f:\n",
    "        for line in f:\n",
    "            if line.strip():\n",
    "                items.append(json.loads(line))\n",
    "    logger.info(f\"Loaded {len(items)} items from JSONL file\")\n",
    "    return items\n",
    "\n",
    "\n",
    "def load_training_file(file_path: str) -> list:\n",
    "    \"\"\"Load training file based on extension.\"\"\"\n",
    "    ext = Path(file_path).suffix.lower()\n",
    "    if ext == '.docx':\n",
    "        return load_docx_file(file_path)\n",
    "    elif ext == '.json':\n",
    "        return load_json_file(file_path)\n",
    "    elif ext == '.jsonl':\n",
    "        return load_jsonl_file(file_path)\n",
    "    else:\n",
    "        raise ValueError(f\"Unsupported file format: {ext}\")\n",
    "\n",
    "\n",
    "logger.info(\"Helper functions defined\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bea7ee7",
   "metadata": {},
   "source": [
    "## Step 1: Load and Configure the Base Model\n",
    "\n",
    "Load Meta-Llama-3-8B-Instruct with 4-bit quantization for efficient pair generation. The model will read your input file and generate formatted instruction/response pairs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0348d7d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "if not torch.cuda.is_available():\n",
    "    raise RuntimeError(\"CUDA not available. Please run in a GPU environment.\")\n",
    "\n",
    "logger.info(f\"Using GPU: {torch.cuda.get_device_name(0)}\")\n",
    "\n",
    "BASE_MODEL = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n",
    "\n",
    "logger.info(f\"Loading base model: {BASE_MODEL}\")\n",
    "tokenizer = AutoTokenizer.from_pretrained(\n",
    "    BASE_MODEL,\n",
    "    cache_dir=str(MODEL_CACHE_DIR),\n",
    "    local_files_only=False,\n",
    ")\n",
    "if tokenizer.pad_token is None:\n",
    "    tokenizer.pad_token = tokenizer.eos_token\n",
    "\n",
    "model = AutoModelForCausalLM.from_pretrained(\n",
    "    BASE_MODEL,\n",
    "    cache_dir=str(MODEL_CACHE_DIR),\n",
    "    quantization_config=BitsAndBytesConfig(\n",
    "        load_in_4bit=True,\n",
    "        bnb_4bit_compute_dtype=torch.float16\n",
    "    ),\n",
    "    device_map=\"auto\",\n",
    "    dtype=torch.float16,\n",
    ")\n",
    "\n",
    "logger.info(\"Model loaded successfully\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbb7155b",
   "metadata": {},
   "source": [
    "## Step 2: Load Your Training File\n",
    "\n",
    "Specify the path to your training file (DOCX, JSON, or JSONL). The notebook will parse it and prepare it for pair generation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe29c8b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "TRAINING_FILE = \"./model/data/data.docx\"\n",
    "training_data = load_training_file(TRAINING_FILE)\n",
    "logger.info(f\"Loaded {len(training_data)} items from training file\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70aa4949",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Loaded {len(training_data)} items\")\n",
    "print(f\"First item type: {type(training_data[0])}\")\n",
    "print(f\"First item (first 200 chars): {str(training_data[0])[:200]}\")\n",
    "if isinstance(training_data[0], dict):\n",
    "    print(f\"First item keys: {list(training_data[0].keys())}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cdfdaa4d",
   "metadata": {},
   "source": [
    "## Step 3: Generate Training Pairs Using the Model\n",
    "\n",
    "The model will read your data and generate structured instruction/response pairs using a prompt-based approach. This ensures consistent formatting for fine-tuning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f36ab365",
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_training_sample(sample) -> str:\n",
    "    \"\"\"Convert a training item into a concise text description.\"\"\"\n",
    "    try:\n",
    "        if isinstance(sample, dict):\n",
    "            parts = []\n",
    "            for k, v in sample.items():\n",
    "                if isinstance(v, str) and v.strip():\n",
    "                    parts.append(f\"{k}: {v.strip()}\")\n",
    "            return \" | \".join(parts) if parts else json.dumps(sample)\n",
    "        if isinstance(sample, str):\n",
    "            return sample.strip()\n",
    "        return str(sample)\n",
    "    except Exception:\n",
    "        return str(sample)\n",
    "\n",
    "\n",
    "def get_optimal_batch_size() -> int:\n",
    "    \"\"\"Calculate optimal batch size based on available GPU memory.\"\"\"\n",
    "    if not torch.cuda.is_available():\n",
    "        return 5\n",
    "\n",
    "    try:\n",
    "        gpu_mem = torch.cuda.get_device_properties(0).total_memory / (1024**3)\n",
    "\n",
    "        logger.info(f\"GPU total memory: {gpu_mem:.2f} GB\")\n",
    "\n",
    "        if gpu_mem >= 24:\n",
    "            return 20\n",
    "        elif gpu_mem >= 16:\n",
    "            return 15\n",
    "        elif gpu_mem >= 12:\n",
    "            return 12\n",
    "        elif gpu_mem >= 8:\n",
    "            return 8\n",
    "        else:\n",
    "            return 5\n",
    "    except Exception as e:\n",
    "        logger.warning(f\"Could not determine GPU memory: {e}. Using conservative batch size.\")\n",
    "        return 5\n",
    "\n",
    "\n",
    "def generate_pairs_with_model(training_data: list, batch_size: int = None, max_tokens: int = 2048) -> list:\n",
    "    \"\"\"\n",
    "    Use the model to generate instruction/response pairs from training data.\n",
    "    Processes data in batches to fit within GPU memory constraints.\n",
    "\n",
    "    Args:\n",
    "        training_data: List of training items to process\n",
    "        batch_size: Number of items per batch (None = auto-detect based on GPU memory)\n",
    "        max_tokens: Maximum tokens to generate per batch (default: 2048)\n",
    "    \"\"\"\n",
    "    if batch_size is None:\n",
    "        batch_size = get_optimal_batch_size()\n",
    "\n",
    "    logger.info(f\"Generating training pairs from {len(training_data)} items\")\n",
    "    logger.info(f\"Batch size: {batch_size}, Max tokens per batch: {max_tokens}\")\n",
    "\n",
    "    all_pairs = []\n",
    "\n",
    "    DEBUG_OUTPUT = False\n",
    "\n",
    "    for i in range(0, len(training_data), batch_size):\n",
    "        batch = training_data[i:i+batch_size]\n",
    "        batch_num = i//batch_size + 1\n",
    "        total_batches = (len(training_data) + batch_size - 1)//batch_size\n",
    "\n",
    "        logger.info(f\"Processing batch {batch_num}/{total_batches} ({len(batch)} items)\")\n",
    "\n",
    "        formatted = [f\"{j+1}. {format_training_sample(item)}\" for j, item in enumerate(batch)]\n",
    "        data_block = \"\\n\".join(formatted)\n",
    "\n",
    "        system_prompt = (\n",
    "            \"You are a JSON generator. Your task is to read content and output ONLY a valid JSON array.\\n\"\n",
    "            \"Each object must have exactly two fields: 'instruction' and 'response'.\\n\"\n",
    "            \"Do not include any text before or after the JSON array.\\n\"\n",
    "            \"The instruction field should be a question or task from the content.\\n\"\n",
    "            \"The response field should be the answer extracted from the content.\\n\"\n",
    "            \"Output MUST be valid JSON - nothing else.\"\n",
    "        )\n",
    "\n",
    "        user_prompt = (\n",
    "            f\"Content to extract training pairs from:\\n{data_block}\\n\\n\"\n",
    "            \"Output a JSON array with instruction-response pairs. Output ONLY the JSON array, no other text:\"\n",
    "        )\n",
    "\n",
    "        prompt = f\"<|im_start|>system\\n{system_prompt}<|im_end|>\\n<|im_start|>user\\n{user_prompt}<|im_end|>\\n<|im_start|>assistant\\n[\"\n",
    "\n",
    "        try:\n",
    "            inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n",
    "\n",
    "            if torch.cuda.is_available():\n",
    "                torch.cuda.empty_cache()\n",
    "\n",
    "            with torch.no_grad():\n",
    "                output = model.generate(\n",
    "                    **inputs,\n",
    "                    max_new_tokens=max_tokens,\n",
    "                    do_sample=True,\n",
    "                    temperature=0.7,\n",
    "                    top_p=0.95,\n",
    "                    top_k=50,\n",
    "                    eos_token_id=tokenizer.eos_token_id,\n",
    "                )\n",
    "\n",
    "            input_length = inputs.input_ids.shape[1]\n",
    "            generated_tokens = output[0][input_length:]\n",
    "            decoded = tokenizer.decode(generated_tokens, skip_special_tokens=True)\n",
    "\n",
    "            if DEBUG_OUTPUT:\n",
    "                print(f\"\\n[BATCH {batch_num} RAW OUTPUT]\")\n",
    "                print(decoded[:500])\n",
    "                print(\"\\n---\")\n",
    "                logger.debug(f\"Model output (first 300 chars): {decoded[:300]}\")\n",
    "\n",
    "            json_text = \"[\" + decoded\n",
    "\n",
    "            json_start = json_text.find(\"[\")\n",
    "            if json_start == -1:\n",
    "                logger.warning(f\"No JSON array found in batch {batch_num} output\")\n",
    "                if DEBUG_OUTPUT:\n",
    "                    print(f\"[BATCH {batch_num}] No '[' found in output\")\n",
    "                continue\n",
    "\n",
    "            bracket_count = 0\n",
    "            in_string = False\n",
    "            escape_next = False\n",
    "            json_end = -1\n",
    "\n",
    "            for idx in range(json_start, len(json_text)):\n",
    "                char = json_text[idx]\n",
    "\n",
    "                if escape_next:\n",
    "                    escape_next = False\n",
    "                    continue\n",
    "\n",
    "                if char == '\\\\':\n",
    "                    escape_next = True\n",
    "                    continue\n",
    "\n",
    "                if char == '\"' and not escape_next:\n",
    "                    in_string = not in_string\n",
    "                    continue\n",
    "\n",
    "                if not in_string:\n",
    "                    if char == '[':\n",
    "                        bracket_count += 1\n",
    "                    elif char == ']':\n",
    "                        bracket_count -= 1\n",
    "                        if bracket_count == 0:\n",
    "                            json_end = idx\n",
    "                            break\n",
    "\n",
    "            if json_end == -1:\n",
    "                logger.warning(f\"Failed to find JSON array boundary in batch {batch_num}\")\n",
    "                continue\n",
    "\n",
    "            try:\n",
    "                json_text = json_text[json_start: json_end + 1]\n",
    "                parsed = json.loads(json_text)\n",
    "\n",
    "                batch_pairs = 0\n",
    "                for item in parsed:\n",
    "                    instr = str(item.get(\"instruction\", \"\")).strip()\n",
    "                    resp = str(item.get(\"response\", \"\")).strip()\n",
    "                    if instr and resp:\n",
    "                        all_pairs.append((instr, resp))\n",
    "                        if DEBUG_OUTPUT:\n",
    "                            print(f\"Instruction: {instr}\\nResponse: {resp}\\n---\")\n",
    "                        batch_pairs += 1\n",
    "\n",
    "                logger.info(f\"Extracted {batch_pairs} pairs from batch {batch_num}\")\n",
    "            except json.JSONDecodeError as e:\n",
    "                logger.error(f\"Failed to parse JSON in batch {batch_num}: {str(e)}\")\n",
    "                if DEBUG_OUTPUT:\n",
    "                    logger.debug(f\"JSON text attempted (first 500 chars): {json_text[:500]}\")\n",
    "\n",
    "                try:\n",
    "                    json_text_fixed = json_text.replace(',]', ']').replace(',}', '}')\n",
    "                    parsed = json.loads(json_text_fixed)\n",
    "\n",
    "                    batch_pairs = 0\n",
    "                    for item in parsed:\n",
    "                        instr = str(item.get(\"instruction\", \"\")).strip()\n",
    "                        resp = str(item.get(\"response\", \"\")).strip()\n",
    "                        if instr and resp:\n",
    "                            all_pairs.append((instr, resp))\n",
    "                            if DEBUG_OUTPUT:\n",
    "                                print(f\"Instruction: {instr}\\nResponse: {resp}\\n---\")\n",
    "                            batch_pairs += 1\n",
    "\n",
    "                    logger.info(f\"Fixed JSON and extracted {batch_pairs} pairs from batch {batch_num}\")\n",
    "                except Exception as e2:\n",
    "                    logger.error(f\"Could not fix JSON in batch {batch_num}: {str(e2)}\")\n",
    "                    continue\n",
    "            except Exception as e:\n",
    "                logger.error(f\"Unexpected error parsing batch {batch_num}: {str(e)}\")\n",
    "                continue\n",
    "\n",
    "        except RuntimeError as e:\n",
    "            if \"out of memory\" in str(e).lower():\n",
    "                logger.error(f\"OOM in batch {batch_num}. Try reducing batch_size or max_tokens.\")\n",
    "                if torch.cuda.is_available():\n",
    "                    torch.cuda.empty_cache()\n",
    "                continue\n",
    "            raise\n",
    "\n",
    "    logger.info(f\"Total pairs generated: {len(all_pairs)}\")\n",
    "    return all_pairs\n",
    "\n",
    "\n",
    "training_pairs = generate_pairs_with_model(training_data, batch_size=None, max_tokens=2048)\n",
    "logger.info(f\"Generated {len(training_pairs)} training pairs\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85673dcd",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"\\n{'='*80}\")\n",
    "print(f\"Total training pairs generated: {len(training_pairs)}\")\n",
    "print(f\"{'='*80}\\n\")\n",
    "\n",
    "if training_pairs:\n",
    "    print(\"Sample training pairs:\")\n",
    "    for i, (instr, resp) in enumerate(training_pairs[:3], 1):\n",
    "        print(f\"\\nPair {i}:\")\n",
    "        print(f\"  Instruction: {instr[:100]}{'...' if len(instr) > 100 else ''}\")\n",
    "        print(f\"  Response: {resp[:100]}{'...' if len(resp) > 100 else ''}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dec03c6",
   "metadata": {},
   "source": [
    "## Step 4: Save Training Data to JSONL Format\n",
    "\n",
    "Export the generated pairs to a JSONL file for use with fine-tuning pipelines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f0f727ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "output_file = DATA_DIR / \"generated_training_pairs.jsonl\"\n",
    "\n",
    "logger.info(f\"Saving {len(training_pairs)} pairs to {output_file}\")\n",
    "\n",
    "with open(output_file, 'w', encoding='utf-8') as f:\n",
    "    for instruction, response in training_pairs:\n",
    "        training_pair = {\n",
    "            \"instruction\": instruction,\n",
    "            \"output\": response,\n",
    "        }\n",
    "        f.write(json.dumps(training_pair, ensure_ascii=False) + \"\\n\")\n",
    "\n",
    "logger.info(f\"Training data saved to {output_file}\")\n",
    "print(f\"\\n✓ Training pairs saved to: {output_file}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "761f92c1",
   "metadata": {},
   "source": [
    "## Cleanup\n",
    "\n",
    "Free GPU memory after pair generation is complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db644782",
   "metadata": {},
   "outputs": [],
   "source": [
    "del model\n",
    "del tokenizer\n",
    "import gc\n",
    "gc.collect()\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    torch.cuda.empty_cache()\n",
    "    torch.cuda.synchronize()\n",
    "\n",
    "logger.info(\"GPU memory freed\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}