Phi-3-mini-4k-instruct实现Python爬虫数据智能处理：自动化采集与清洗-洪萨配资

Phi-3-mini-4k-instruct实现Python爬虫数据智能处理：自动化采集与清洗

做爬虫的朋友应该都有过这样的经历：面对一个全新的网站，你得花上半天时间研究它的HTML结构，写一堆正则表达式或者XPath，好不容易把数据抓下来了，结果发现格式乱七八糟，还得再写一堆清洗代码。整个过程下来，技术含量不高，但重复劳动特别多。

最近我在尝试用Phi-3-mini-4k-instruct来优化这个流程，发现效果还挺不错的。这个只有38亿参数的小模型，在代码理解和生成方面表现相当出色，特别适合用来处理爬虫开发中的那些重复性工作。

1. 为什么选择Phi-3-mini-4k-instruct来做爬虫助手？

你可能觉得奇怪，为什么不用那些更大的模型？我刚开始也有这个疑问，但实际用下来发现，Phi-3-mini-4k-instruct有几个特别适合爬虫场景的优势。

首先它真的很轻量。2.2GB的模型大小，在我的笔记本上就能流畅运行，不需要什么高端显卡。这意味着你可以把它当作一个本地工具来用，不用担心API调用次数限制，也不用担心数据隐私问题。

更重要的是，它的训练数据里包含了大量Python代码。根据官方文档，Phi-3的训练数据主要基于Python，而且特别关注了像typing、math、random、collections、datetime、itertools这些常用包。这对于爬虫开发来说简直是量身定做——我们用的requests、BeautifulSoup、lxml、pandas这些库，它都相当熟悉。

我测试过几个场景：让它根据网站URL生成爬虫代码、分析网页结构自动提取数据、处理那些烦人的反爬机制，效果都超出了我的预期。虽然它偶尔会犯一些小错误，但整体思路是对的，稍微调整一下就能用。

2. 快速搭建本地开发环境

2.1 安装Ollama和Phi-3模型

如果你还没装Ollama，这个过程很简单。打开终端，一行命令搞定：

curl -fsSL https://ollama.com/install.sh | sh

安装完成后，拉取Phi-3-mini-4k-instruct模型：

ollama run phi3

第一次运行会下载模型，大概2.2GB，取决于你的网络速度。下载完成后，模型就常驻在本地了，随时可以调用。

2.2 配置Python开发环境

我建议创建一个独立的虚拟环境，这样依赖管理比较清晰：

python -m venv phi3-crawler-env source phi3-crawler-env/bin/activate # Linux/Mac # 或者 phi3-crawler-env\Scripts\activate # Windows

然后安装必要的Python包：

pip install requests beautifulsoup4 lxml pandas pip install ollama # 这是Ollama的Python客户端

2.3 测试模型连接

写个简单的测试脚本，确保一切正常：

import ollama def test_phi3(): """测试Phi-3模型是否能正常响应""" try: response = ollama.chat( model='phi3', messages=[{ 'role': 'user', 'content': '用Python写一个简单的HTTP请求示例' }] ) print("模型响应正常！") print("示例代码：") print(response['message']['content']) return True except Exception as e: print(f"连接失败：{e}") return False if __name__ == "__main__": test_phi3()

如果看到模型生成的Python代码，说明环境配置成功了。

3. 智能爬虫代码生成实战

3.1 根据网站URL自动生成爬虫框架

以前我们要分析一个网站，得手动打开开发者工具，一个个元素看过去。现在可以让Phi-3来帮我们做初步分析。

我写了一个辅助函数，专门用来生成爬虫框架：

import ollama import requests from bs4 import BeautifulSoup def generate_crawler_from_url(url, target_elements=None): """ 根据URL自动生成爬虫代码框架 Args: url: 目标网站URL target_elements: 可选，指定要抓取的元素类型，如['h1', 'p', 'a'] """ # 先获取网页内容，让模型分析结构 try: response = requests.get(url, timeout=10) response.raise_for_status() html_content = response.text[:2000] # 取前2000字符给模型分析 except Exception as e: return f"无法获取网页内容：{e}" # 构建提示词 prompt = f""" 请分析以下网页内容，并生成一个Python爬虫代码框架。 网页URL: {url} 网页内容片段： {html_content} 需要抓取的元素：{target_elements if target_elements else '所有主要内容'} 请生成完整的Python代码，包括： 1. 必要的import语句 2. 发送HTTP请求的部分 3. 解析HTML的代码 4. 数据提取逻辑 5. 错误处理 6. 数据保存（建议保存为CSV格式） 代码要简洁实用，适合新手理解。 """ # 调用Phi-3模型 response = ollama.chat( model='phi3', messages=[{'role': 'user', 'content': prompt}] ) return response['message']['content'] # 使用示例 if __name__ == "__main__": url = "https://example.com" code = generate_crawler_from_url(url, target_elements=['h1', 'h2', 'p', 'a']) print("生成的爬虫代码：") print(code) # 可以把代码保存到文件 with open('generated_crawler.py', 'w', encoding='utf-8') as f: f.write(code)

这个函数做了几件事：先获取网页内容，然后让Phi-3分析结构，最后生成完整的爬虫代码。我测试了几个新闻网站和电商网站，生成的代码质量都还不错，基本可以直接用。

3.2 智能解析网页结构

有时候网页结构比较复杂，手动写选择器很麻烦。Phi-3可以帮我们分析出最佳的数据提取方案。

def analyze_page_structure(url): """分析网页结构，推荐数据提取方案""" # 获取完整页面 response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # 提取关键信息给模型分析 page_info = { 'title': soup.title.string if soup.title else '无标题', 'meta_description': soup.find('meta', attrs={'name': 'description'}), 'h1_count': len(soup.find_all('h1')), 'h2_count': len(soup.find_all('h2')), 'paragraph_count': len(soup.find_all('p')), 'link_count': len(soup.find_all('a')), 'table_count': len(soup.find_all('table')), } # 构建分析提示 prompt = f""" 分析以下网页的结构特征，给出数据提取建议： 网页信息： - 标题：{page_info['title']} - H1标题数量：{page_info['h1_count']} - H2标题数量：{page_info['h2_count']} - 段落数量：{page_info['paragraph_count']} - 链接数量：{page_info['link_count']} - 表格数量：{page_info['table_count']} 请回答： 1. 这个网页可能是什么类型的网站？（新闻、电商、博客等） 2. 最重要的数据可能在哪里？ 3. 推荐使用什么选择器来提取数据？（CSS选择器或XPath） 4. 需要注意哪些反爬机制？ 5. 给出3个最可能的数据提取方案 """ response = ollama.chat( model='phi3', messages=[{'role': 'user', 'content': prompt}] ) return response['message']['content'] # 使用示例 analysis = analyze_page_structure("https://news.example.com") print("网页结构分析结果：") print(analysis)

这个分析功能特别有用，尤其是面对陌生网站时。Phi-3能根据网页的标签分布，判断出网站类型，并给出针对性的提取建议。

4. 自动化数据清洗与格式化

抓下来的数据往往很乱，清洗工作最耗时。我们可以让Phi-3来帮我们写清洗规则。

4.1 智能数据清洗代码生成

def generate_data_cleaner(raw_data_sample, desired_format): """ 根据原始数据样本和期望格式，生成数据清洗代码 Args: raw_data_sample: 原始数据样本（字符串或字典列表） desired_format: 期望的数据格式描述 """ prompt = f""" 请根据以下原始数据样本和期望格式，生成Python数据清洗代码。 原始数据样本： {str(raw_data_sample)[:1000]} # 限制长度 期望的数据格式： {desired_format} 请生成完整的清洗函数，包括： 1. 去除空白字符和特殊字符 2. 格式转换（如日期、数字） 3. 数据验证 4. 缺失值处理 5. 去重逻辑 函数应该能够处理批量数据，并返回清洗后的结果。 """ response = ollama.chat( model='phi3', messages=[{'role': 'user', 'content': prompt}] ) return response['message']['content'] # 使用示例 raw_data = [ {"title": " Python编程 ", "price": "$99.99", "date": "2024-01-01"}, {"title": "机器学习入门 ", "price": "149.99元", "date": "2023/12/15"}, {"title": "数据分析实战", "price": "200", "date": ""} ] desired_format = """ 清洗后的数据应该符合以下格式： - title: 字符串，去除首尾空格，首字母大写 - price: 浮点数，去除货币符号，转换为数字 - date: 统一为YYYY-MM-DD格式，缺失值填充为"未知" """ cleaner_code = generate_data_cleaner(raw_data, desired_format) print("生成的数据清洗代码：") print(cleaner_code)

4.2 实际清洗函数示例

让我们看看Phi-3生成的清洗代码长什么样。我运行上面的示例后，得到了这样的代码：

import pandas as pd import re from datetime import datetime def clean_data(raw_data_list): """ 清洗原始数据，转换为统一格式 Args: raw_data_list: 原始数据字典列表 Returns: list: 清洗后的数据字典列表 """ cleaned_data = [] for item in raw_data_list: cleaned_item = {} # 清洗title title = item.get('title', '').strip() if title: # 去除多余空格，首字母大写 title = ' '.join(title.split()) cleaned_item['title'] = title.title() else: cleaned_item['title'] = '未知标题' # 清洗price price = item.get('price', '0') # 移除货币符号和中文单位 price_clean = re.sub(r'[^\d.]', '', price) try: cleaned_item['price'] = float(price_clean) if price_clean else 0.0 except ValueError: cleaned_item['price'] = 0.0 # 清洗date date_str = item.get('date', '').strip() if date_str: # 尝试不同格式解析 for fmt in ('%Y-%m-%d', '%Y/%m/%d', '%Y.%m.%d'): try: date_obj = datetime.strptime(date_str, fmt) cleaned_item['date'] = date_obj.strftime('%Y-%m-%d') break except ValueError: continue else: cleaned_item['date'] = '未知' else: cleaned_item['date'] = '未知' cleaned_data.append(cleaned_item) return cleaned_data # 使用示例 if __name__ == "__main__": raw_data = [ {"title": " Python编程 ", "price": "$99.99", "date": "2024-01-01"}, {"title": "机器学习入门 ", "price": "149.99元", "date": "2023/12/15"}, {"title": "数据分析实战", "price": "200", "date": ""} ] cleaned = clean_data(raw_data) for item in cleaned: print(item)

这个生成的代码质量相当不错，考虑了多种日期格式，处理了货币符号，还有错误处理。虽然可能还需要微调，但已经节省了大量时间。

5. 处理常见反爬机制

5.1 自动识别和应对反爬策略

反爬是爬虫开发中最头疼的问题。Phi-3可以帮助我们识别常见的反爬机制，并生成应对代码。

def generate_anti_anti_crawler_strategies(url, error_messages=None): """ 生成应对反爬机制的策略和代码 Args: url: 目标网站URL error_messages: 遇到的错误信息列表 """ prompt = f""" 针对网站 {url}，请提供完整的反爬虫应对方案。 已知问题：{error_messages if error_messages else '暂无具体错误信息'} 请提供以下内容： 1. 该网站可能使用的反爬机制分析 2. 对应的解决方案和Python代码 3. 请求头设置建议 4. 代理IP使用策略 5. 请求频率控制方案 6. 验证码处理建议（如果可能遇到） 代码要实用，可以直接集成到爬虫中。 """ response = ollama.chat( model='phi3', messages=[{'role': 'user', 'content': prompt}] ) return response['message']['content'] # 使用示例 strategies = generate_anti_anti_crawler_strategies( url="https://example.com", error_messages=["HTTP 403 Forbidden", "请求频率过高被限制"] ) print("反爬应对策略：") print(strategies)

5.2 完整的反爬虫代码模板

根据Phi-3的建议，我们可以构建一个健壮的爬虫类：

import requests import time import random from fake_useragent import UserAgent class RobustCrawler: """具有反爬应对能力的爬虫类""" def __init__(self, use_proxy=False): self.session = requests.Session() self.ua = UserAgent() self.use_proxy = use_proxy self.request_count = 0 self.last_request_time = time.time() # 设置通用请求头 self.headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'keep-alive', 'Upgrade-Insecure-Requests': '1', } def get_with_retry(self, url, max_retries=3, timeout=10): """带重试机制的GET请求""" for attempt in range(max_retries): try: # 随机延迟，避免请求过于频繁 delay = random.uniform(1, 3) elapsed = time.time() - self.last_request_time if elapsed < delay: time.sleep(delay - elapsed) # 更新请求头 self.headers['User-Agent'] = self.ua.random # 发送请求 response = self.session.get( url, headers=self.headers, timeout=timeout ) self.request_count += 1 self.last_request_time = time.time() # 检查响应状态 if response.status_code == 200: return response elif response.status_code == 403: print(f"第{attempt+1}次尝试：遇到403错误，更换User-Agent重试") continue elif response.status_code == 429: wait_time = 30 * (attempt + 1) # 指数退避 print(f"请求过于频繁，等待{wait_time}秒后重试") time.sleep(wait_time) continue else: response.raise_for_status() except requests.exceptions.RequestException as e: print(f"第{attempt+1}次请求失败：{e}") if attempt == max_retries - 1: raise time.sleep(2 ** attempt) # 指数退避 return None def rotate_proxy(self): """轮换代理IP（需要自己实现代理IP池）""" if self.use_proxy: # 这里需要接入你的代理IP服务 # 示例：self.session.proxies = {'http': 'http://proxy_ip:port'} pass # 使用示例 crawler = RobustCrawler() response = crawler.get_with_retry("https://example.com") if response: print("请求成功！") print(f"响应长度：{len(response.text)} 字符")

6. 完整项目实战：电商价格监控爬虫

让我们用一个实际项目来展示Phi-3的完整应用。假设我们要监控某个电商网站的商品价格变化。

6.1 项目需求分析

首先，我们让Phi-3帮我们分析需求并设计架构：

def design_crawler_architecture(requirements): """根据需求设计爬虫架构""" prompt = f""" 设计一个电商价格监控爬虫系统，需求如下： {requirements} 请提供： 1. 系统架构设计（组件图） 2. 数据库设计（如果需要） 3. 核心模块划分 4. 技术选型建议 5. 部署方案 6. 监控和报警机制 用清晰的中文描述，适合中级开发者理解。 """ response = ollama.chat( model='phi3', messages=[{'role': 'user', 'content': prompt}] ) return response['message']['content'] # 定义需求 requirements = """ 1. 监控10个电商网站的100种商品价格 2. 每30分钟抓取一次价格 3. 检测价格变化，降价时发送通知 4. 存储历史价格数据 5. 支持分布式部署 6. 有Web界面查看监控状态 """ architecture = design_crawler_architecture(requirements) print("系统架构设计：") print(architecture)

6.2 核心代码实现

基于Phi-3的建议，我们实现核心的监控爬虫：

import requests import json import time import sqlite3 from datetime import datetime from bs4 import BeautifulSoup import schedule from typing import Dict, List, Optional class PriceMonitor: """电商价格监控爬虫""" def __init__(self, db_path='prices.db'): self.db_path = db_path self.init_database() self.products = self.load_products() def init_database(self): """初始化数据库""" conn = sqlite3.connect(self.db_path) cursor = conn.cursor() # 创建商品表 cursor.execute(''' CREATE TABLE IF NOT EXISTS products ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT NOT NULL, url TEXT NOT NULL, website TEXT NOT NULL, selector TEXT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) ''') # 创建价格历史表 cursor.execute(''' CREATE TABLE IF NOT EXISTS price_history ( id INTEGER PRIMARY KEY AUTOINCREMENT, product_id INTEGER, price REAL NOT NULL, currency TEXT DEFAULT 'CNY', timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP, FOREIGN KEY (product_id) REFERENCES products (id) ) ''') # 创建价格变化通知表 cursor.execute(''' CREATE TABLE IF NOT EXISTS price_alerts ( id INTEGER PRIMARY KEY AUTOINCREMENT, product_id INTEGER, old_price REAL, new_price REAL, change_percent REAL, timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP, notified BOOLEAN DEFAULT 0, FOREIGN KEY (product_id) REFERENCES products (id) ) ''') conn.commit() conn.close() def load_products(self) -> List[Dict]: """从数据库加载监控商品""" conn = sqlite3.connect(self.db_path) cursor = conn.cursor() cursor.execute('SELECT * FROM products') rows = cursor.fetchall() conn.close() products = [] for row in rows: products.append({ 'id': row[0], 'name': row[1], 'url': row[2], 'website': row[3], 'selector': row[4] }) return products def extract_price(self, html: str, selector: str) -> Optional[float]: """从HTML中提取价格""" # 如果提供了CSS选择器，使用BeautifulSoup提取 if selector: soup = BeautifulSoup(html, 'html.parser') element = soup.select_one(selector) if element: text = element.get_text().strip() else: return None else: # 如果没有选择器，尝试自动查找价格 soup = BeautifulSoup(html, 'html.parser') text = soup.get_text() # 使用Phi-3辅助分析价格模式 prompt = f""" 从以下文本中提取商品价格（数字）： 文本内容：{text[:500]} 请只返回价格数字，不要其他内容。 如果找不到价格，返回"未找到"。 示例： 输入："特价￥199.99元，原价299" 输出：199.99 """ try: response = ollama.chat( model='phi3', messages=[{'role': 'user', 'content': prompt}] ) price_text = response['message']['content'].strip() # 清理提取结果 import re match = re.search(r'(\d+\.?\d*)', price_text) if match: return float(match.group(1)) except: pass return None def fetch_product_price(self, product: Dict) -> Optional[float]: """抓取单个商品价格""" try: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' } response = requests.get( product['url'], headers=headers, timeout=10 ) if response.status_code == 200: price = self.extract_price(response.text, product['selector']) return price else: print(f"请求失败：{response.status_code}") return None except Exception as e: print(f"抓取商品 {product['name']} 失败：{e}") return None def monitor_all_products(self): """监控所有商品""" print(f"{datetime.now()} 开始监控 {len(self.products)} 个商品...") for product in self.products: current_price = self.fetch_product_price(product) if current_price is not None: # 保存价格记录 self.save_price_record(product['id'], current_price) # 检查价格变化 self.check_price_change(product['id'], current_price) # 避免请求过于频繁 time.sleep(random.uniform(1, 2)) print(f"{datetime.now()} 监控完成") def save_price_record(self, product_id: int, price: float): """保存价格记录到数据库""" conn = sqlite3.connect(self.db_path) cursor = conn.cursor() cursor.execute( 'INSERT INTO price_history (product_id, price) VALUES (?, ?)', (product_id, price) ) conn.commit() conn.close() def check_price_change(self, product_id: int, current_price: float): """检查价格变化，触发通知""" conn = sqlite3.connect(self.db_path) cursor = conn.cursor() # 获取上一次价格 cursor.execute(''' SELECT price FROM price_history WHERE product_id = ? ORDER BY timestamp DESC LIMIT 1 OFFSET 1 ''', (product_id,)) result = cursor.fetchone() if result: previous_price = result[0] # 计算变化百分比 if previous_price > 0: change_percent = ((current_price - previous_price) / previous_price) * 100 # 如果降价超过5%，记录通知 if change_percent < -5: cursor.execute(''' INSERT INTO price_alerts (product_id, old_price, new_price, change_percent) VALUES (?, ?, ?, ?) ''', (product_id, previous_price, current_price, change_percent)) conn.commit() print(f"价格下降警报：{product_id}，从{previous_price}降到{current_price}，降幅{change_percent:.1f}%") conn.close() def run_scheduler(self): """启动定时监控""" schedule.every(30).minutes.do(self.monitor_all_products) print("价格监控系统已启动，每30分钟运行一次") print("按Ctrl+C退出") # 立即运行一次 self.monitor_all_products() # 保持运行 while True: schedule.run_pending() time.sleep(1) # 使用示例 if __name__ == "__main__": monitor = PriceMonitor() # 添加示例商品（实际使用时应该从数据库管理） conn = sqlite3.connect('prices.db') cursor = conn.cursor() # 确保有测试数据 cursor.execute(''' INSERT OR IGNORE INTO products (name, url, website, selector) VALUES (?, ?, ?, ?) ''', ( '示例商品', 'https://example.com/product', 'example', '.price-selector' )) conn.commit() conn.close() # 启动监控 try: monitor.run_scheduler() except KeyboardInterrupt: print("监控已停止")

7. 调试技巧与最佳实践

7.1 如何让Phi-3生成更好的代码

经过一段时间的使用，我总结了一些让Phi-3生成更优质代码的技巧：

提供足够的上下文：不要只给一个简单的指令。告诉模型你要做什么、为什么做、期望的输出是什么。

# 不好的提示 prompt = "写一个爬虫" # 好的提示 prompt = """ 我需要一个爬虫来抓取新闻网站的文章。 要求： 1. 处理分页，自动翻页直到没有新内容 2. 提取标题、发布时间、作者、正文 3. 避免被反爬，需要随机延迟和User-Agent轮换 4. 数据保存为JSON格式 5. 包含完整的错误处理 请用Python实现，使用requests和BeautifulSoup。 """

分步骤请求：复杂的任务可以拆分成多个步骤，让模型一步步完成。

提供示例：如果你有特定的代码风格要求，提供一个示例让模型学习。

7.2 常见问题与解决方案

问题1：模型生成的代码有语法错误

解决方案：让模型先解释思路，再生成代码。或者分模块生成，逐个测试。

问题2：选择器不准确

解决方案：提供实际的HTML片段，让模型基于具体内容生成选择器。

问题3：处理动态加载内容

解决方案：明确告诉模型网站使用了JavaScript动态加载，需要模拟浏览器或使用Selenium。

def generate_selenium_crawler(url): """生成Selenium爬虫代码""" prompt = f""" 网站 {url} 使用了JavaScript动态加载内容。 请生成使用Selenium的爬虫代码。 要求： 1. 使用Chrome浏览器驱动 2. 等待动态内容加载完成 3. 处理可能的弹窗和登录 4. 滚动页面加载更多内容 5. 最后关闭浏览器 代码要完整，可以直接运行。 """ response = ollama.chat( model='phi3', messages=[{'role': 'user', 'content': prompt}] ) return response['message']['content']

8. 总结

用Phi-3-mini-4k-instruct来辅助Python爬虫开发，确实能显著提升效率。它最擅长的不是替代你写整个项目，而是处理那些重复性高、模式固定的任务。

比如分析网页结构、生成数据清洗规则、应对常见的反爬机制，这些工作Phi-3都能做得不错。虽然生成的代码偶尔需要手动调整，但相比从头开始写，节省的时间是实实在在的。

我特别喜欢它的本地部署特性。爬虫开发经常需要反复调试，如果每次都要调用云端API，不仅慢，还可能遇到限流。本地模型就没有这些问题，想怎么调就怎么调。

当然，它也有局限性。对于特别复杂的网站结构，或者需要高度定制化的业务逻辑，还是需要人工介入。但作为开发助手，Phi-3已经足够优秀了。

如果你经常做爬虫项目，我强烈建议试试这个组合。从环境搭建到实际应用，整个过程都很顺畅。最重要的是，它能让你把精力集中在更有价值的部分，而不是浪费在重复劳动上。

获取更多AI镜像
想探索更多AI镜像和应用场景？访问 CSDN星图镜像广场，提供丰富的预置镜像，覆盖大模型推理、图像生成、视频生成、模型微调等多个领域，支持一键部署。

Phi-3-mini-4k-instruct实现Python爬虫数据智能处理：自动化采集与清洗