Dynavera/notebooks/prepare-training-file.ipynb

584 lines
21 KiB
Text
Raw Normal View History

2026-01-28 09:55:43 +00:00
{
"cells": [
{
"cell_type": "markdown",
"id": "c9cd197e",
"metadata": {},
"source": [
"# Prepare Training File: Load Model & Generate Training Pairs\n",
"\n",
"This notebook loads a language model and uses it to generate structured instruction/response training pairs from any input file. The generated pairs can be used directly for fine-tuning."
]
},
{
"cell_type": "markdown",
"id": "556d3fe5",
"metadata": {},
"source": [
"## Setup: Environment Variables\n",
"\n",
"Configure CUDA and PyTorch environment variables to disable BF16 and FP16 precision reductions for stable training."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a25b6a3b",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"os.environ[\"CUDA_DISABLE_BF16\"] = \"1\"\n",
"os.environ[\"TORCH_CUDA_ALLOW_BF16_REDUCED_PRECISION_REDUCTION\"] = \"0\"\n",
"os.environ[\"ACCELERATE_DISABLE_FP16\"] = \"1\""
]
},
{
"cell_type": "markdown",
"id": "97b9e212",
"metadata": {},
"source": [
"## Setup: Import Required Libraries\n",
"\n",
"Import necessary libraries including transformers, torch, datasets, python-docx, json, os, and other utilities for document processing and model loading."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d63d552",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import logging\n",
"import os\n",
"from pathlib import Path\n",
"\n",
"from docx import Document\n",
"from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
"import torch\n",
"\n",
"logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')\n",
"logger = logging.getLogger(__name__)"
]
},
{
"cell_type": "markdown",
"id": "84e04da2",
"metadata": {},
"source": [
"## Setup: Configure Directory Structure\n",
"\n",
"Create and organize directory paths for storing training data, models, and intermediate outputs."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "993ed003",
"metadata": {},
"outputs": [],
"source": [
"OUTPUT_DIR = Path(\"./build/training_prep\")\n",
"OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
"DATA_DIR = OUTPUT_DIR / \"data\"\n",
"DATA_DIR.mkdir(exist_ok=True)\n",
"MODELS_DIR = OUTPUT_DIR / \"models\"\n",
"MODELS_DIR.mkdir(exist_ok=True)\n",
"\n",
"MODEL_CACHE_DIR = Path(\"./model/base-model\")\n",
"MODEL_CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
"os.environ[\"HF_HOME\"] = str(MODEL_CACHE_DIR)\n",
"\n",
"logger.info(f\"Output directory: {OUTPUT_DIR}\")\n",
"logger.info(f\"Model cache directory: {MODEL_CACHE_DIR}\")"
]
},
{
"cell_type": "markdown",
"id": "0439c534",
"metadata": {},
"source": [
"## Setup: Helper Functions\n",
"\n",
"Define utility functions for loading various file formats (DOCX, JSON, JSONL)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e34ff2b7",
"metadata": {},
"outputs": [],
"source": [
"def load_docx_file(file_path: str) -> list:\n",
" \"\"\"Load and parse a DOCX file into paragraphs.\"\"\"\n",
" logger.info(f\"Loading DOCX file: {file_path}\")\n",
" doc = Document(file_path)\n",
" paragraphs = [p.text.strip() for p in doc.paragraphs if p.text.strip()]\n",
" logger.info(f\"Extracted {len(paragraphs)} paragraphs from {file_path}\")\n",
" return paragraphs\n",
"\n",
"\n",
"def load_json_file(file_path: str) -> list:\n",
" \"\"\"Load a JSON file (array or object).\"\"\"\n",
" logger.info(f\"Loading JSON file: {file_path}\")\n",
" with open(file_path, 'r', encoding='utf-8') as f:\n",
" data = json.load(f)\n",
" if isinstance(data, list):\n",
" logger.info(f\"Loaded {len(data)} items from JSON file\")\n",
" return data\n",
" elif isinstance(data, dict):\n",
" logger.info(f\"JSON file is dict, converting to list\")\n",
" return [data]\n",
" return []\n",
"\n",
"\n",
"def load_jsonl_file(file_path: str) -> list:\n",
" \"\"\"Load a JSONL file (one JSON object per line).\"\"\"\n",
" logger.info(f\"Loading JSONL file: {file_path}\")\n",
" items = []\n",
" with open(file_path, 'r', encoding='utf-8') as f:\n",
" for line in f:\n",
" if line.strip():\n",
" items.append(json.loads(line))\n",
" logger.info(f\"Loaded {len(items)} items from JSONL file\")\n",
" return items\n",
"\n",
"\n",
"def load_training_file(file_path: str) -> list:\n",
" \"\"\"Load training file based on extension.\"\"\"\n",
" ext = Path(file_path).suffix.lower()\n",
" if ext == '.docx':\n",
" return load_docx_file(file_path)\n",
" elif ext == '.json':\n",
" return load_json_file(file_path)\n",
" elif ext == '.jsonl':\n",
" return load_jsonl_file(file_path)\n",
" else:\n",
" raise ValueError(f\"Unsupported file format: {ext}\")\n",
"\n",
"\n",
"logger.info(\"Helper functions defined\")"
]
},
{
"cell_type": "markdown",
"id": "3bea7ee7",
"metadata": {},
"source": [
"## Step 1: Load and Configure the Base Model\n",
"\n",
"Load Meta-Llama-3-8B-Instruct with 4-bit quantization for efficient pair generation. The model will read your input file and generate formatted instruction/response pairs."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0348d7d6",
"metadata": {},
"outputs": [],
"source": [
"if not torch.cuda.is_available():\n",
" raise RuntimeError(\"CUDA not available. Please run in a GPU environment.\")\n",
"\n",
"logger.info(f\"Using GPU: {torch.cuda.get_device_name(0)}\")\n",
"\n",
"BASE_MODEL = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n",
"\n",
"logger.info(f\"Loading base model: {BASE_MODEL}\")\n",
"tokenizer = AutoTokenizer.from_pretrained(\n",
" BASE_MODEL,\n",
" cache_dir=str(MODEL_CACHE_DIR),\n",
" local_files_only=False,\n",
")\n",
"if tokenizer.pad_token is None:\n",
" tokenizer.pad_token = tokenizer.eos_token\n",
"\n",
"model = AutoModelForCausalLM.from_pretrained(\n",
" BASE_MODEL,\n",
" cache_dir=str(MODEL_CACHE_DIR),\n",
" quantization_config=BitsAndBytesConfig(\n",
" load_in_4bit=True,\n",
" bnb_4bit_compute_dtype=torch.float16\n",
" ),\n",
" device_map=\"auto\",\n",
" dtype=torch.float16,\n",
")\n",
"\n",
"logger.info(\"Model loaded successfully\")"
]
},
{
"cell_type": "markdown",
"id": "bbb7155b",
"metadata": {},
"source": [
"## Step 2: Load Your Training File\n",
"\n",
"Specify the path to your training file (DOCX, JSON, or JSONL). The notebook will parse it and prepare it for pair generation."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe29c8b2",
"metadata": {},
"outputs": [],
"source": [
"TRAINING_FILE = \"./model/data/data.docx\"\n",
"training_data = load_training_file(TRAINING_FILE)\n",
"logger.info(f\"Loaded {len(training_data)} items from training file\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70aa4949",
"metadata": {},
"outputs": [],
"source": [
"print(f\"Loaded {len(training_data)} items\")\n",
"print(f\"First item type: {type(training_data[0])}\")\n",
"print(f\"First item (first 200 chars): {str(training_data[0])[:200]}\")\n",
"if isinstance(training_data[0], dict):\n",
" print(f\"First item keys: {list(training_data[0].keys())}\")"
]
},
{
"cell_type": "markdown",
"id": "cdfdaa4d",
"metadata": {},
"source": [
"## Step 3: Generate Training Pairs Using the Model\n",
"\n",
"The model will read your data and generate structured instruction/response pairs using a prompt-based approach. This ensures consistent formatting for fine-tuning."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f36ab365",
"metadata": {},
"outputs": [],
"source": [
"def format_training_sample(sample) -> str:\n",
" \"\"\"Convert a training item into a concise text description.\"\"\"\n",
" try:\n",
" if isinstance(sample, dict):\n",
" parts = []\n",
" for k, v in sample.items():\n",
" if isinstance(v, str) and v.strip():\n",
" parts.append(f\"{k}: {v.strip()}\")\n",
" return \" | \".join(parts) if parts else json.dumps(sample)\n",
" if isinstance(sample, str):\n",
" return sample.strip()\n",
" return str(sample)\n",
" except Exception:\n",
" return str(sample)\n",
"\n",
"\n",
"def get_optimal_batch_size() -> int:\n",
" \"\"\"Calculate optimal batch size based on available GPU memory.\"\"\"\n",
" if not torch.cuda.is_available():\n",
" return 5\n",
"\n",
" try:\n",
" gpu_mem = torch.cuda.get_device_properties(0).total_memory / (1024**3)\n",
"\n",
" logger.info(f\"GPU total memory: {gpu_mem:.2f} GB\")\n",
"\n",
" if gpu_mem >= 24:\n",
" return 20\n",
" elif gpu_mem >= 16:\n",
" return 15\n",
" elif gpu_mem >= 12:\n",
" return 12\n",
" elif gpu_mem >= 8:\n",
" return 8\n",
" else:\n",
" return 5\n",
" except Exception as e:\n",
" logger.warning(f\"Could not determine GPU memory: {e}. Using conservative batch size.\")\n",
" return 5\n",
"\n",
"\n",
"def generate_pairs_with_model(training_data: list, batch_size: int = None, max_tokens: int = 2048) -> list:\n",
" \"\"\"\n",
" Use the model to generate instruction/response pairs from training data.\n",
" Processes data in batches to fit within GPU memory constraints.\n",
"\n",
" Args:\n",
" training_data: List of training items to process\n",
" batch_size: Number of items per batch (None = auto-detect based on GPU memory)\n",
" max_tokens: Maximum tokens to generate per batch (default: 2048)\n",
" \"\"\"\n",
" if batch_size is None:\n",
" batch_size = get_optimal_batch_size()\n",
"\n",
" logger.info(f\"Generating training pairs from {len(training_data)} items\")\n",
" logger.info(f\"Batch size: {batch_size}, Max tokens per batch: {max_tokens}\")\n",
"\n",
" all_pairs = []\n",
"\n",
" DEBUG_OUTPUT = False\n",
"\n",
" for i in range(0, len(training_data), batch_size):\n",
" batch = training_data[i:i+batch_size]\n",
" batch_num = i//batch_size + 1\n",
" total_batches = (len(training_data) + batch_size - 1)//batch_size\n",
"\n",
" logger.info(f\"Processing batch {batch_num}/{total_batches} ({len(batch)} items)\")\n",
"\n",
" formatted = [f\"{j+1}. {format_training_sample(item)}\" for j, item in enumerate(batch)]\n",
" data_block = \"\\n\".join(formatted)\n",
"\n",
" system_prompt = (\n",
" \"You are a JSON generator. Your task is to read content and output ONLY a valid JSON array.\\n\"\n",
" \"Each object must have exactly two fields: 'instruction' and 'response'.\\n\"\n",
" \"Do not include any text before or after the JSON array.\\n\"\n",
" \"The instruction field should be a question or task from the content.\\n\"\n",
" \"The response field should be the answer extracted from the content.\\n\"\n",
" \"Output MUST be valid JSON - nothing else.\"\n",
" )\n",
"\n",
" user_prompt = (\n",
" f\"Content to extract training pairs from:\\n{data_block}\\n\\n\"\n",
" \"Output a JSON array with instruction-response pairs. Output ONLY the JSON array, no other text:\"\n",
" )\n",
"\n",
" prompt = f\"<|im_start|>system\\n{system_prompt}<|im_end|>\\n<|im_start|>user\\n{user_prompt}<|im_end|>\\n<|im_start|>assistant\\n[\"\n",
"\n",
" try:\n",
" inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n",
"\n",
" if torch.cuda.is_available():\n",
" torch.cuda.empty_cache()\n",
"\n",
" with torch.no_grad():\n",
" output = model.generate(\n",
" **inputs,\n",
" max_new_tokens=max_tokens,\n",
" do_sample=True,\n",
" temperature=0.7,\n",
" top_p=0.95,\n",
" top_k=50,\n",
" eos_token_id=tokenizer.eos_token_id,\n",
" )\n",
"\n",
" input_length = inputs.input_ids.shape[1]\n",
" generated_tokens = output[0][input_length:]\n",
" decoded = tokenizer.decode(generated_tokens, skip_special_tokens=True)\n",
"\n",
" if DEBUG_OUTPUT:\n",
" print(f\"\\n[BATCH {batch_num} RAW OUTPUT]\")\n",
" print(decoded[:500])\n",
" print(\"\\n---\")\n",
" logger.debug(f\"Model output (first 300 chars): {decoded[:300]}\")\n",
"\n",
" json_text = \"[\" + decoded\n",
"\n",
" json_start = json_text.find(\"[\")\n",
" if json_start == -1:\n",
" logger.warning(f\"No JSON array found in batch {batch_num} output\")\n",
" if DEBUG_OUTPUT:\n",
" print(f\"[BATCH {batch_num}] No '[' found in output\")\n",
" continue\n",
"\n",
" bracket_count = 0\n",
" in_string = False\n",
" escape_next = False\n",
" json_end = -1\n",
"\n",
" for idx in range(json_start, len(json_text)):\n",
" char = json_text[idx]\n",
"\n",
" if escape_next:\n",
" escape_next = False\n",
" continue\n",
"\n",
" if char == '\\\\':\n",
" escape_next = True\n",
" continue\n",
"\n",
" if char == '\"' and not escape_next:\n",
" in_string = not in_string\n",
" continue\n",
"\n",
" if not in_string:\n",
" if char == '[':\n",
" bracket_count += 1\n",
" elif char == ']':\n",
" bracket_count -= 1\n",
" if bracket_count == 0:\n",
" json_end = idx\n",
" break\n",
"\n",
" if json_end == -1:\n",
" logger.warning(f\"Failed to find JSON array boundary in batch {batch_num}\")\n",
" continue\n",
"\n",
" try:\n",
" json_text = json_text[json_start: json_end + 1]\n",
" parsed = json.loads(json_text)\n",
"\n",
" batch_pairs = 0\n",
" for item in parsed:\n",
" instr = str(item.get(\"instruction\", \"\")).strip()\n",
" resp = str(item.get(\"response\", \"\")).strip()\n",
" if instr and resp:\n",
" all_pairs.append((instr, resp))\n",
" if DEBUG_OUTPUT:\n",
" print(f\"Instruction: {instr}\\nResponse: {resp}\\n---\")\n",
" batch_pairs += 1\n",
"\n",
" logger.info(f\"Extracted {batch_pairs} pairs from batch {batch_num}\")\n",
" except json.JSONDecodeError as e:\n",
" logger.error(f\"Failed to parse JSON in batch {batch_num}: {str(e)}\")\n",
" if DEBUG_OUTPUT:\n",
" logger.debug(f\"JSON text attempted (first 500 chars): {json_text[:500]}\")\n",
"\n",
" try:\n",
" json_text_fixed = json_text.replace(',]', ']').replace(',}', '}')\n",
" parsed = json.loads(json_text_fixed)\n",
"\n",
" batch_pairs = 0\n",
" for item in parsed:\n",
" instr = str(item.get(\"instruction\", \"\")).strip()\n",
" resp = str(item.get(\"response\", \"\")).strip()\n",
" if instr and resp:\n",
" all_pairs.append((instr, resp))\n",
" if DEBUG_OUTPUT:\n",
" print(f\"Instruction: {instr}\\nResponse: {resp}\\n---\")\n",
" batch_pairs += 1\n",
"\n",
" logger.info(f\"Fixed JSON and extracted {batch_pairs} pairs from batch {batch_num}\")\n",
" except Exception as e2:\n",
" logger.error(f\"Could not fix JSON in batch {batch_num}: {str(e2)}\")\n",
" continue\n",
" except Exception as e:\n",
" logger.error(f\"Unexpected error parsing batch {batch_num}: {str(e)}\")\n",
" continue\n",
"\n",
" except RuntimeError as e:\n",
" if \"out of memory\" in str(e).lower():\n",
" logger.error(f\"OOM in batch {batch_num}. Try reducing batch_size or max_tokens.\")\n",
" if torch.cuda.is_available():\n",
" torch.cuda.empty_cache()\n",
" continue\n",
" raise\n",
"\n",
" logger.info(f\"Total pairs generated: {len(all_pairs)}\")\n",
" return all_pairs\n",
"\n",
"\n",
"training_pairs = generate_pairs_with_model(training_data, batch_size=None, max_tokens=2048)\n",
"logger.info(f\"Generated {len(training_pairs)} training pairs\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "85673dcd",
"metadata": {},
"outputs": [],
"source": [
"print(f\"\\n{'='*80}\")\n",
"print(f\"Total training pairs generated: {len(training_pairs)}\")\n",
"print(f\"{'='*80}\\n\")\n",
"\n",
"if training_pairs:\n",
" print(\"Sample training pairs:\")\n",
" for i, (instr, resp) in enumerate(training_pairs[:3], 1):\n",
" print(f\"\\nPair {i}:\")\n",
" print(f\" Instruction: {instr[:100]}{'...' if len(instr) > 100 else ''}\")\n",
" print(f\" Response: {resp[:100]}{'...' if len(resp) > 100 else ''}\")"
]
},
{
"cell_type": "markdown",
"id": "4dec03c6",
"metadata": {},
"source": [
"## Step 4: Save Training Data to JSONL Format\n",
"\n",
"Export the generated pairs to a JSONL file for use with fine-tuning pipelines."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0f727ee",
"metadata": {},
"outputs": [],
"source": [
"output_file = DATA_DIR / \"generated_training_pairs.jsonl\"\n",
"\n",
"logger.info(f\"Saving {len(training_pairs)} pairs to {output_file}\")\n",
"\n",
"with open(output_file, 'w', encoding='utf-8') as f:\n",
" for instruction, response in training_pairs:\n",
" training_pair = {\n",
" \"instruction\": instruction,\n",
" \"output\": response,\n",
" }\n",
" f.write(json.dumps(training_pair, ensure_ascii=False) + \"\\n\")\n",
"\n",
"logger.info(f\"Training data saved to {output_file}\")\n",
"print(f\"\\n✓ Training pairs saved to: {output_file}\")"
]
},
{
"cell_type": "markdown",
"id": "761f92c1",
"metadata": {},
"source": [
"## Cleanup\n",
"\n",
"Free GPU memory after pair generation is complete."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db644782",
"metadata": {},
"outputs": [],
"source": [
"del model\n",
"del tokenizer\n",
"import gc\n",
"gc.collect()\n",
"\n",
"if torch.cuda.is_available():\n",
" torch.cuda.empty_cache()\n",
" torch.cuda.synchronize()\n",
"\n",
"logger.info(\"GPU memory freed\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}