告别零散截图！用Python+Selenium把整个网站一键打包成MHTML文件（附完整代码）-洪萨配资

告别零散截图！用Python+Selenium把整个网站一键打包成MHTML文件（附完整代码）

每次遇到优秀的教程网站或文档站点，你是否也遇到过这样的困扰：想要离线保存却只能一张张截图，或者用浏览器自带的"另存为"功能，结果发现样式错乱、图片丢失？今天，我将分享一个高效解决方案——用Python+Selenium将整个网站完整打包成MHTML文件。这种方法不仅能保留原始布局、图片和样式，还能实现自动化批量处理，特别适合需要离线学习、内容备份或网站分析的场景。

MHTML（MIME HTML）是一种将网页所有资源（HTML、CSS、JavaScript、图片等）打包成单个文件的格式。相比传统截图或HTML保存方式，它有三大优势：

完整性：保留原始网页的所有元素和交互功能
便携性：单个文件易于存储和分享
可读性：无需联网即可完整呈现

1. 环境准备与工具链搭建

1.1 必备工具安装

开始前，请确保已安装以下组件：

pip install selenium beautifulsoup4 tqdm

同时需要下载对应版本的ChromeDriver（与本地Chrome浏览器版本匹配），并将其路径添加到系统环境变量中。

1.2 Chrome DevTools Protocol配置

我们将利用Chrome的开发者工具协议来生成MHTML文件。创建一个新的Python文件，添加以下基础配置：

from selenium import webdriver from selenium.webdriver.chrome.options import Options chrome_options = Options() chrome_options.add_argument("--headless") # 无界面模式 chrome_options.add_argument("--disable-gpu") driver = webdriver.Chrome(options=chrome_options)

2. 核心代码实现与解析

2.1 网站链接抓取模块

首先需要获取目标网站的所有页面链接。这里我们使用递归爬取方式：

import requests from bs4 import BeautifulSoup from urllib.parse import urljoin def get_all_links(base_url, max_depth=3): visited = set() to_visit = {base_url} all_links = set() for _ in range(max_depth): current_url = to_visit.pop() if current_url in visited: continue try: response = requests.get(current_url, timeout=10) soup = BeautifulSoup(response.text, 'html.parser') for link in soup.find_all('a', href=True): absolute_url = urljoin(base_url, link['href']) if absolute_url.startswith(base_url): all_links.add(absolute_url) to_visit.add(absolute_url) visited.add(current_url) except Exception as e: print(f"Error processing {current_url}: {str(e)}") return all_links

2.2 MHTML生成与保存

关键部分是利用Chrome DevTools Protocol捕获页面快照：

def save_as_mhtml(driver, url, output_path): driver.get(url) time.sleep(3) # 等待页面完全加载 try: # 使用CDP命令捕获MHTML result = driver.execute_cdp_cmd('Page.captureSnapshot', {}) with open(output_path, 'w', encoding='utf-8') as f: f.write(result['data']) return True except Exception as e: print(f"Failed to save {url}: {str(e)}") return False

3. 高级功能实现

3.1 处理动态加载内容

现代网站常使用AJAX动态加载内容，我们需要确保这些内容被完整捕获：

from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By def wait_for_dynamic_content(driver, timeout=10): try: WebDriverWait(driver, timeout).until( lambda d: d.execute_script( "return document.readyState === 'complete'" ) ) # 等待常见动态元素 WebDriverWait(driver, timeout).until( EC.presence_of_all_elements_located((By.TAG_NAME, 'img')) ) except Exception as e: print(f"Dynamic content loading timeout: {str(e)}")

3.2 反爬策略应对

针对有反爬机制的网站，可以添加以下配置：

chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36") chrome_options.add_argument("--disable-blink-features=AutomationControlled")

4. 完整工作流与优化建议

4.1 主程序整合

将各个模块整合成完整的工作流：

import os import time from tqdm import tqdm def main(): base_url = "https://example.com" # 替换为目标网站 output_dir = "mhtml_output" if not os.path.exists(output_dir): os.makedirs(output_dir) # 初始化浏览器驱动 chrome_options = Options() chrome_options.add_argument("--headless") driver = webdriver.Chrome(options=chrome_options) # 获取所有链接 print("Collecting all page links...") all_links = get_all_links(base_url) print(f"Found {len(all_links)} pages to save") # 保存每个页面为MHTML for link in tqdm(all_links, desc="Saving pages"): filename = link.replace(base_url, "").replace("/", "_") if not filename: filename = "index" output_path = os.path.join(output_dir, f"{filename}.mhtml") save_as_mhtml(driver, link, output_path) driver.quit() print("All pages saved successfully!")

4.2 性能优化技巧

并行处理：使用多线程加速页面捕获
断点续传：记录已处理的URL，避免重复工作
智能等待：根据网络状况动态调整等待时间

from concurrent.futures import ThreadPoolExecutor def process_page(link): # 实现单个页面的处理逻辑 pass with ThreadPoolExecutor(max_workers=4) as executor: list(tqdm(executor.map(process_page, all_links), total=len(all_links)))

5. 实际应用中的问题排查

5.1 常见错误与解决方案

错误类型	可能原因	解决方案
页面不完整	动态内容未加载	增加等待时间或添加显式等待
保存失败	特殊字符路径	对文件名进行清洗处理
超时错误	网络延迟	调整超时设置或重试机制

5.2 日志记录与调试

建议添加详细的日志记录功能：

import logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', filename='website_saver.log' ) def save_as_mhtml(driver, url, output_path): try: logging.info(f"Processing {url}") # ...原有代码... except Exception as e: logging.error(f"Failed to save {url}: {str(e)}") raise

在实际项目中，我发现最耗时的部分往往是等待动态内容加载。通过分析具体网站的加载模式，可以定制更精确的等待策略。例如，某些单页应用(SPA)需要等待特定DOM元素出现后才算加载完成。