RexUniNLU保姆级教程：日志埋点+Prometheus监控+NLU服务性能大盘搭建

张开发

• 2026/4/12 9:57:11 • 15 分钟阅读

分享文章

RexUniNLU保姆级教程日志埋点Prometheus监控NLU服务性能大盘搭建1. 为什么需要监控NLU服务当你把RexUniNLU部署到生产环境后会发现几个现实问题用户说服务响应时快时慢但不知道具体慢在哪里出现识别错误时很难快速定位是哪个环节出了问题服务到底承载了多少请求量峰值时期能否稳定运行没有监控的NLU服务就像在黑夜里开车——你不知道速度多少不知道油量还剩多少甚至不知道前方有没有危险。通过本教程你将学会如何为RexUniNLU添加完整的监控体系让服务运行状态一目了然。2. 环境准备与依赖安装在开始之前确保你已经部署了RexUniNLU基础服务。我们需要额外安装一些监控相关的依赖包。# 安装监控所需的Python依赖 pip install prometheus-client loguru # 安装Prometheus选择一种方式 # 方式一Docker安装推荐 docker run -d -p 9090:9090 --name prometheus prom/prometheus # 方式二本地安装 # 从 https://prometheus.io/download/ 下载对应版本创建监控专用的配置文件目录mkdir -p monitoring/config mkdir -p monitoring/logs3. 为RexUniNLU添加日志埋点日志是监控的基础好的日志能让我们快速定位问题。我们使用loguru这个强大的日志库来替代Python自带的logging。创建monitoring/logger_setup.py文件from loguru import logger import json import time import os def setup_logger(): 配置日志系统 log_path monitoring/logs/nlu_service.log # 清除默认配置 logger.remove() # 添加文件日志每天轮转保留7天 logger.add( log_path, rotation00:00, # 每天午夜轮转 retention7 days, # 保留7天 encodingutf-8, format{time:YYYY-MM-DD HH:mm:ss} | {level} | {message}, levelINFO ) # 添加错误日志单独存储 logger.add( monitoring/logs/error.log, rotation00:00, retention30 days, encodingutf-8, format{time:YYYY-MM-DD HH:mm:ss} | {level} | {message}, levelERROR ) return logger # 初始化日志 nlu_logger setup_logger() def log_request(request_id, text, labels, result, processing_time): 记录请求日志 log_data { request_id: request_id, text: text, labels: labels, result: result, processing_time: processing_time, timestamp: time.time() } nlu_logger.info(fNLU_REQUEST|{json.dumps(log_data, ensure_asciiFalse)}) def log_error(request_id, error_type, error_message, textNone): 记录错误日志 error_data { request_id: request_id, error_type: error_type, error_message: error_message, text: text, timestamp: time.time() } nlu_logger.error(fNLU_ERROR|{json.dumps(error_data, ensure_asciiFalse)})4. 集成Prometheus监控指标Prometheus是流行的监控系统我们需要在RexUniNLU中暴露各种监控指标。创建monitoring/prometheus_metrics.py文件from prometheus_client import Counter, Gauge, Histogram, generate_latest, REGISTRY import time # 定义监控指标 NLU_REQUESTS_TOTAL Counter( nlu_requests_total, Total NLU requests, [method, endpoint] ) NLU_REQUEST_DURATION Histogram( nlu_request_duration_seconds, NLU request duration in seconds, [method, endpoint] ) NLU_REQUEST_SIZE Gauge( nlu_request_size_bytes, Size of NLU request in bytes ) NLU_RESPONSE_SIZE Gauge( nlu_response_size_bytes, Size of NLU response in bytes ) NLU_SUCCESS_RATE Gauge( nlu_success_rate, NLU request success rate ) NLU_MODEL_LOAD_TIME Gauge( nlu_model_load_time_seconds, Time taken to load NLU model ) # 请求计时装饰器 def track_request_time(method, endpoint): def decorator(func): def wrapper(*args, **kwargs): start_time time.time() try: result func(*args, **kwargs) duration time.time() - start_time NLU_REQUEST_DURATION.labels(methodmethod, endpointendpoint).observe(duration) NLU_REQUESTS_TOTAL.labels(methodmethod, endpointendpoint).inc() return result except Exception as e: duration time.time() - start_time NLU_REQUEST_DURATION.labels(methodmethod, endpointendpoint).observe(duration) raise e return wrapper return decorator5. 改造RexUniNLU服务端现在我们需要修改原来的server.py集成日志和监控功能。创建新的server_with_monitoring.py文件from fastapi import FastAPI, Request, HTTPException from fastapi.responses import JSONResponse import uvicorn import time import json import uuid from typing import Dict, List # 导入监控组件 from monitoring.logger_setup import log_request, log_error, nlu_logger from monitoring.prometheus_metrics import ( track_request_time, NLU_REQUESTS_TOTAL, NLU_REQUEST_SIZE, NLU_RESPONSE_SIZE, NLU_SUCCESS_RATE ) # 导入原有的RexUniNLU功能 from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks app FastAPI(titleRexUniNLU with Monitoring) # 全局变量用于存储模型和统计信息 nlu_pipeline None request_stats {total: 0, success: 0} app.on_event(startup) async def startup_event(): 服务启动时加载模型 global nlu_pipeline try: nlu_logger.info(开始加载RexUniNLU模型...) start_time time.time() nlu_pipeline pipeline( taskTasks.siamese_uie, modeldamo/nlp_siamese-uie_chinese-base ) load_time time.time() - start_time nlu_logger.info(f模型加载完成耗时: {load_time:.2f}秒) except Exception as e: nlu_logger.error(f模型加载失败: {str(e)}) raise e app.middleware(http) async def monitor_requests(request: Request, call_next): 监控中间件 request_id str(uuid.uuid4()) start_time time.time() # 记录请求信息 body await request.body() NLU_REQUEST_SIZE.set(len(body)) try: response await call_next(request) processing_time time.time() - start_time # 记录响应信息 response_body b async for chunk in response.body_iterator: response_body chunk NLU_RESPONSE_SIZE.set(len(response_body)) # 更新成功率统计 request_stats[total] 1 request_stats[success] 1 success_rate request_stats[success] / request_stats[total] NLU_SUCCESS_RATE.set(success_rate) return Response(contentresponse_body, status_coderesponse.status_code, headersdict(response.headers)) except Exception as e: processing_time time.time() - start_time log_error(request_id, middleware_error, str(e)) raise HTTPException(status_code500, detail内部服务器错误) app.post(/nlu) track_request_time(POST, /nlu) async def nlu_endpoint(request: Request): NLU处理端点 request_id str(uuid.uuid4()) try: data await request.json() text data.get(text, ) labels data.get(labels, []) if not text or not labels: raise HTTPException(status_code400, detail缺少text或labels参数) start_time time.time() # 调用NLU处理 result nlu_pipeline(inputtext, schemalabels) processing_time time.time() - start_time # 记录日志 log_request(request_id, text, labels, result, processing_time) return { request_id: request_id, result: result, processing_time: processing_time, status: success } except Exception as e: log_error(request_id, nlu_processing_error, str(e), text) raise HTTPException(status_code500, detailf处理失败: {str(e)}) app.get(/metrics) async def metrics(): Prometheus指标端点 from monitoring.prometheus_metrics import generate_latest return Response(generate_latest(), media_typetext/plain) app.get(/health) async def health_check(): 健康检查端点 return { status: healthy, model_loaded: nlu_pipeline is not None, request_stats: request_stats } if __name__ __main__: uvicorn.run(app, host0.0.0.0, port8000)6. 配置Prometheus监控创建Prometheus配置文件monitoring/config/prometheus.ymlglobal: scrape_interval: 15s # 每15秒采集一次数据 evaluation_interval: 15s scrape_configs: - job_name: rexuninlu static_configs: - targets: [localhost:8000] # NLU服务地址 metrics_path: /metrics scrape_interval: 10s - job_name: prometheus static_configs: - targets: [localhost:9090]启动Prometheus# 使用Docker启动 docker run -d \ -p 9090:9090 \ -v $(pwd)/monitoring/config/prometheus.yml:/etc/prometheus/prometheus.yml \ --name prometheus \ prom/prometheus # 或者使用本地安装的Prometheus ./prometheus --config.filemonitoring/config/prometheus.yml7. 使用Grafana创建监控大盘安装Grafana# Docker方式 docker run -d -p 3000:3000 --name grafana grafana/grafana # 或者本地安装 # 参考: https://grafana.com/docs/grafana/latest/installation/配置Grafana数据源访问 http://localhost:3000默认账号/密码admin/admin添加Prometheus数据源http://localhost:9090导入NLU监控仪表盘创建monitoring/config/grafana-dashboard.json{ title: RexUniNLU性能监控, panels: [ { title: 请求吞吐量, type: graph, targets: [{ expr: rate(nlu_requests_total[1m]), legendFormat: 请求速率 }] }, { title: 响应时间分布, type: graph, targets: [{ expr: histogram_quantile(0.95, rate(nlu_request_duration_seconds_bucket[5m])), legendFormat: P95响应时间 }] }, { title: 成功率, type: stat, targets: [{ expr: nlu_success_rate * 100, legendFormat: 成功率 }] } ] }8. 实际运行与测试启动增强版的NLU服务python server_with_monitoring.py测试服务是否正常# 测试健康检查 curl http://localhost:8000/health # 测试NLU功能 curl -X POST http://localhost:8000/nlu \ -H Content-Type: application/json \ -d { text: 帮我定一张明天去上海的机票, labels: [出发地, 目的地, 时间, 订票意图] } # 查看监控指标 curl http://localhost:8000/metrics9. 监控效果验证访问各个监控界面检查是否正常工作Prometheus: http://localhost:9090Grafana: http://localhost:3000查看日志文件:monitoring/logs/nlu_service.log你应该能看到Prometheus中能够采集到NLU服务的各项指标Grafana中能够看到请求量、响应时间、成功率等图表日志文件中记录了详细的请求和处理信息10. 总结通过本教程我们为RexUniNLU搭建了完整的监控体系核心收获学会了如何使用loguru进行结构化日志记录掌握了Prometheus监控指标的集成方法了解了如何通过Grafana可视化监控数据构建了从日志埋点到监控大盘的完整链路实际价值实时掌握服务运行状态快速发现性能瓶颈出现问题时能够快速定位和排查基于数据做容量规划和服务优化提升服务可靠性和用户体验下一步建议添加告警规则当成功率下降或响应时间变长时自动通知集成更多维度的监控指标如内存使用、GPU利用率等建立日志分析系统对识别结果进行质量分析设置自动化测试定期验证服务健康状态现在你的RexUniNLU服务已经从不透明的黑盒变成了完全可观测的白盒能够更好地服务于生产环境。获取更多AI镜像想探索更多AI镜像和应用场景访问 CSDN星图镜像广场提供丰富的预置镜像覆盖大模型推理、图像生成、视频生成、模型微调等多个领域支持一键部署。

更多文章

前端开发 2026/4/12 9:57:05

南开计算机复试面试：除了408和简历，老师到底想听你说什么？（避坑指南+真实流程还原）

南开计算机复试面试：如何用20分钟征服导师的思维战场走进南开大学计算机复试考场的那一刻，空气仿佛凝固了——五位教授的目光同时聚焦在你身上。这不是简单的知识问答，而是一场精心设计的认知博弈。初试成绩只是入场券，真正决定命…

如何快速提取Wallpaper Engine资源：3个高效技巧指南【免费下载链接】repkg Wallpaper engine PKG extractor/TEX to image converter 项目地址: https://gitcode.com/gh_mirrors/re/repkg RePKG是一款专门用于Wallpaper Engine资源解包和格式转换的C#开源工…

张开发

前端开发 2026/4/12 9:45:58

GLM-4.1V-9B-Base入门教程：适配中文视觉理解任务的提示词设计方法

GLM-4.1V-9B-Base入门教程：适配中文视觉理解任务的提示词设计方法 1. 认识GLM-4.1V-9B-Base模型 GLM-4.1V-9B-Base是智谱开源的一款专注于视觉多模态理解的AI模型。这个模型特别擅长处理与图片相关的任务，比如识别图片内容、描述场景、回答关于图片的问…

张开发

RexUniNLU保姆级教程：日志埋点+Prometheus监控+NLU服务性能大盘搭建

最新文章

Qwen3-VL-8B多模态能力展示：数学公式识别、代码截图解释、手写体理解

魔兽地图开发的终极格式转换利器：W3x2Lni完整指南

Jimeng LoRA快速部署指南：无需配置，三步启动你的专属风格化AI绘画测试台

推荐系统架构设计思路

智能解放双手：MAA如何让明日方舟日常任务自动化

低成本GPU算力方案：cv_resnet50_face-reconstruction在RTX3060上高效运行实测

推荐文章

在Windows系统安装Docker

HagiCode Desktop 混合分发架构解析：如何用 PP 加速大文件下载籽

TensorRT安装避坑指南：解决‘cuda_runtime_api.h not found’等常见错误

WindowsCleaner终极指南：3步解决C盘爆红，让Windows系统重获新生

告别TF卡！手把手教你给ROCK5B的SPI Nor Flash刷入NVMe启动引导（附固件包）

鱿鱼视频小说网站模板源码：快速搭建双模式资源站，轻松开启运营之路

相关文章

钢坯火焰清理机设计【开题报告+任务书+毕业论文+CAD图纸+翻译】

15 | Claude Code Hooks 事件驱动自动化：防微杜渐的安全防线

Linux党福利：Debian12下用VSCode+SDCC玩转51单片机（含WSL配置指南）

从微调到精控：可变电阻在音频电路中的深度应用解析

Mahony、互补滤波与卡尔曼：给嵌入式新手的六轴姿态融合算法选型指南

保姆级教程：在WSL2的Ubuntu 22.04上，用CUDA 12.9编译运行llama.cpp（含模型下载避坑指南）

分享文章

更多文章

南开计算机复试面试：除了408和简历，老师到底想听你说什么？（避坑指南+真实流程还原）

Asian Beauty Z-Image Turbo 多风格融合展示：从写实到二次元的无缝转换

小白友好：Qwen3-0.6B-FP8部署全流程，Chainlit让交互可视化

FigmaCN：3分钟让Figma界面说中文的终极解决方案

如何彻底解决Windows驱动残留问题：显卡驱动清理的终极指南

别让AI被‘带坏’：从一次真实的客服机器人被‘教唆’事件，聊聊提示词注入的实战防御

终极指南：5步快速掌握MetaboAnalystR代谢组学数据分析

深入解析Flink资源分配：TaskManager进程数、CPU核数与Slot配置的最佳实践

网盘直链下载助手终极指南：八大网盘真实链接一键获取，轻松告别下载限速

高效解密网易云音乐NCM格式的专业解决方案

如何快速提取Wallpaper Engine资源：3个高效技巧指南

GLM-4.1V-9B-Base入门教程：适配中文视觉理解任务的提示词设计方法