logging.warning("No sample page text available for Gemini context.")
# --- Step 1: Initial Structure Extraction ---
# We ask Gemini to generate an initial structure.
initial_prompt=f"""
You are an extremely meticulous expert at analyzing {lang} school textbooks to extract their hierarchical structure.
Your task is to identify Units, Concepts, Lessons, and other sections, along with their exact start and end page numbers.
**Crucially, assign page ranges for UNITS and CONCEPTS by determining the start of the first lesson/concept and the end of the last lesson/concept within them, if no explicit range is given for the parent.**
Pay extremely close attention to titles, headings, and page number references in the provided text.
Infer page ranges for all elements as accurately as possible.
**Definitions and Page Range Rules:**
- **Unit:** The highest level (e.g., "Unit 1", "الوحدة الأولى"). `start_page` is where the unit title first appears. `end_page` is the page *before* the next unit starts, or the last page of the book if it's the final unit ({pdf_total_pages}).
- **Concept:** A sub-division within a Unit (e.g., "Concept 1", "المفهوم الأول"). If it has lessons, its `start_page` is the `start_page` of its first lesson, and its `end_page` is the `end_page` of its last lesson. If no lessons, its `start_page` is where its title appears, and `end_page` is the page *before* the next section.
- **Lesson:** A specific topic within a Concept (e.g., "Lesson 1.1", "الدرس 1.1"). `start_page` is where its title appears. `end_page` is the page *before* the next lesson or section.
- **Other Sections:** Introductory/concluding elements within a Unit (e.g., "Get Started", "مقدمة", "Unit Project", "مشروع الوحدة"). `start_page` where its title appears, `end_page` is page *before* next section.
Output must be ONLY a valid JSON object (no markdown, no extra text) conforming to this exact Pydantic structure definition:
# Now, we give Gemini the initial structure and ask it to refine it
# using the TOC and page tracking as explicit correction tools.
refinement_prompt=f"""
You are an expert at validating and correcting hierarchical book structures.
I have an initial structure extracted from a {lang} textbook (Grade: {grade}), but it needs careful review and correction, especially for page ranges and the completeness of sections.
**Instructions for Refinement:**
1. **Strictly adhere to the Pydantic schema.** Ensure all fields are present and types are correct.
2. **Verify all page ranges.** Ensure `start_page` is always less than or equal to `end_page`. If `start_page` is 0, attempt to find a valid page. If `end_page` is missing or invalid, infer it as the page *before* the next section, or the total pages of the book ({pdf_total_pages}) if it's the last section.
3. **Cross-reference with TOC and Tracked Titles:**
* If the `start_page` for a Unit, Concept, or Lesson in the initial structure conflicts with a `start_page` explicitly mentioned in the provided TOC or `Pre-identified Section Titles`, **prioritize the TOC/Pre-identified value**.
* Use the TOC content to identify any *missing* Units, Concepts, or Lessons that should be included. If found, add them with inferred page ranges.
4. **Sequential Page Range Logic:**
* `end_page` of a section should ideally be `start_page - 1` of the *next* logical section (at the same or higher hierarchical level).
* For parent elements (Units, Concepts), if their `pages` are not explicitly defined, ensure they span from the `start_page` of their first child to the `end_page` of their last child.
5. **Completeness:** Ensure all significant Units, Concepts, Lessons, and 'other_sections' that logically exist in the book are present in the final structure. Use the TOC and tracked titles as primary guides for completeness.
6. **Do NOT include any markdown or extra text outside the JSON.**