RexUniNLU保姆级教程:日志埋点+Prometheus监控+NLU服务性能大盘搭建

张开发
2026/4/12 9:57:11 15 分钟阅读

分享文章

RexUniNLU保姆级教程:日志埋点+Prometheus监控+NLU服务性能大盘搭建
RexUniNLU保姆级教程日志埋点Prometheus监控NLU服务性能大盘搭建1. 为什么需要监控NLU服务当你把RexUniNLU部署到生产环境后会发现几个现实问题用户说服务响应时快时慢但不知道具体慢在哪里出现识别错误时很难快速定位是哪个环节出了问题服务到底承载了多少请求量峰值时期能否稳定运行没有监控的NLU服务就像在黑夜里开车——你不知道速度多少不知道油量还剩多少甚至不知道前方有没有危险。通过本教程你将学会如何为RexUniNLU添加完整的监控体系让服务运行状态一目了然。2. 环境准备与依赖安装在开始之前确保你已经部署了RexUniNLU基础服务。我们需要额外安装一些监控相关的依赖包。# 安装监控所需的Python依赖 pip install prometheus-client loguru # 安装Prometheus选择一种方式 # 方式一Docker安装推荐 docker run -d -p 9090:9090 --name prometheus prom/prometheus # 方式二本地安装 # 从 https://prometheus.io/download/ 下载对应版本创建监控专用的配置文件目录mkdir -p monitoring/config mkdir -p monitoring/logs3. 为RexUniNLU添加日志埋点日志是监控的基础好的日志能让我们快速定位问题。我们使用loguru这个强大的日志库来替代Python自带的logging。创建monitoring/logger_setup.py文件from loguru import logger import json import time import os def setup_logger(): 配置日志系统 log_path monitoring/logs/nlu_service.log # 清除默认配置 logger.remove() # 添加文件日志每天轮转保留7天 logger.add( log_path, rotation00:00, # 每天午夜轮转 retention7 days, # 保留7天 encodingutf-8, format{time:YYYY-MM-DD HH:mm:ss} | {level} | {message}, levelINFO ) # 添加错误日志单独存储 logger.add( monitoring/logs/error.log, rotation00:00, retention30 days, encodingutf-8, format{time:YYYY-MM-DD HH:mm:ss} | {level} | {message}, levelERROR ) return logger # 初始化日志 nlu_logger setup_logger() def log_request(request_id, text, labels, result, processing_time): 记录请求日志 log_data { request_id: request_id, text: text, labels: labels, result: result, processing_time: processing_time, timestamp: time.time() } nlu_logger.info(fNLU_REQUEST|{json.dumps(log_data, ensure_asciiFalse)}) def log_error(request_id, error_type, error_message, textNone): 记录错误日志 error_data { request_id: request_id, error_type: error_type, error_message: error_message, text: text, timestamp: time.time() } nlu_logger.error(fNLU_ERROR|{json.dumps(error_data, ensure_asciiFalse)})4. 集成Prometheus监控指标Prometheus是流行的监控系统我们需要在RexUniNLU中暴露各种监控指标。创建monitoring/prometheus_metrics.py文件from prometheus_client import Counter, Gauge, Histogram, generate_latest, REGISTRY import time # 定义监控指标 NLU_REQUESTS_TOTAL Counter( nlu_requests_total, Total NLU requests, [method, endpoint] ) NLU_REQUEST_DURATION Histogram( nlu_request_duration_seconds, NLU request duration in seconds, [method, endpoint] ) NLU_REQUEST_SIZE Gauge( nlu_request_size_bytes, Size of NLU request in bytes ) NLU_RESPONSE_SIZE Gauge( nlu_response_size_bytes, Size of NLU response in bytes ) NLU_SUCCESS_RATE Gauge( nlu_success_rate, NLU request success rate ) NLU_MODEL_LOAD_TIME Gauge( nlu_model_load_time_seconds, Time taken to load NLU model ) # 请求计时装饰器 def track_request_time(method, endpoint): def decorator(func): def wrapper(*args, **kwargs): start_time time.time() try: result func(*args, **kwargs) duration time.time() - start_time NLU_REQUEST_DURATION.labels(methodmethod, endpointendpoint).observe(duration) NLU_REQUESTS_TOTAL.labels(methodmethod, endpointendpoint).inc() return result except Exception as e: duration time.time() - start_time NLU_REQUEST_DURATION.labels(methodmethod, endpointendpoint).observe(duration) raise e return wrapper return decorator5. 改造RexUniNLU服务端现在我们需要修改原来的server.py集成日志和监控功能。创建新的server_with_monitoring.py文件from fastapi import FastAPI, Request, HTTPException from fastapi.responses import JSONResponse import uvicorn import time import json import uuid from typing import Dict, List # 导入监控组件 from monitoring.logger_setup import log_request, log_error, nlu_logger from monitoring.prometheus_metrics import ( track_request_time, NLU_REQUESTS_TOTAL, NLU_REQUEST_SIZE, NLU_RESPONSE_SIZE, NLU_SUCCESS_RATE ) # 导入原有的RexUniNLU功能 from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks app FastAPI(titleRexUniNLU with Monitoring) # 全局变量用于存储模型和统计信息 nlu_pipeline None request_stats {total: 0, success: 0} app.on_event(startup) async def startup_event(): 服务启动时加载模型 global nlu_pipeline try: nlu_logger.info(开始加载RexUniNLU模型...) start_time time.time() nlu_pipeline pipeline( taskTasks.siamese_uie, modeldamo/nlp_siamese-uie_chinese-base ) load_time time.time() - start_time nlu_logger.info(f模型加载完成耗时: {load_time:.2f}秒) except Exception as e: nlu_logger.error(f模型加载失败: {str(e)}) raise e app.middleware(http) async def monitor_requests(request: Request, call_next): 监控中间件 request_id str(uuid.uuid4()) start_time time.time() # 记录请求信息 body await request.body() NLU_REQUEST_SIZE.set(len(body)) try: response await call_next(request) processing_time time.time() - start_time # 记录响应信息 response_body b async for chunk in response.body_iterator: response_body chunk NLU_RESPONSE_SIZE.set(len(response_body)) # 更新成功率统计 request_stats[total] 1 request_stats[success] 1 success_rate request_stats[success] / request_stats[total] NLU_SUCCESS_RATE.set(success_rate) return Response(contentresponse_body, status_coderesponse.status_code, headersdict(response.headers)) except Exception as e: processing_time time.time() - start_time log_error(request_id, middleware_error, str(e)) raise HTTPException(status_code500, detail内部服务器错误) app.post(/nlu) track_request_time(POST, /nlu) async def nlu_endpoint(request: Request): NLU处理端点 request_id str(uuid.uuid4()) try: data await request.json() text data.get(text, ) labels data.get(labels, []) if not text or not labels: raise HTTPException(status_code400, detail缺少text或labels参数) start_time time.time() # 调用NLU处理 result nlu_pipeline(inputtext, schemalabels) processing_time time.time() - start_time # 记录日志 log_request(request_id, text, labels, result, processing_time) return { request_id: request_id, result: result, processing_time: processing_time, status: success } except Exception as e: log_error(request_id, nlu_processing_error, str(e), text) raise HTTPException(status_code500, detailf处理失败: {str(e)}) app.get(/metrics) async def metrics(): Prometheus指标端点 from monitoring.prometheus_metrics import generate_latest return Response(generate_latest(), media_typetext/plain) app.get(/health) async def health_check(): 健康检查端点 return { status: healthy, model_loaded: nlu_pipeline is not None, request_stats: request_stats } if __name__ __main__: uvicorn.run(app, host0.0.0.0, port8000)6. 配置Prometheus监控创建Prometheus配置文件monitoring/config/prometheus.ymlglobal: scrape_interval: 15s # 每15秒采集一次数据 evaluation_interval: 15s scrape_configs: - job_name: rexuninlu static_configs: - targets: [localhost:8000] # NLU服务地址 metrics_path: /metrics scrape_interval: 10s - job_name: prometheus static_configs: - targets: [localhost:9090]启动Prometheus# 使用Docker启动 docker run -d \ -p 9090:9090 \ -v $(pwd)/monitoring/config/prometheus.yml:/etc/prometheus/prometheus.yml \ --name prometheus \ prom/prometheus # 或者使用本地安装的Prometheus ./prometheus --config.filemonitoring/config/prometheus.yml7. 使用Grafana创建监控大盘安装Grafana# Docker方式 docker run -d -p 3000:3000 --name grafana grafana/grafana # 或者本地安装 # 参考: https://grafana.com/docs/grafana/latest/installation/配置Grafana数据源访问 http://localhost:3000默认账号/密码admin/admin添加Prometheus数据源http://localhost:9090导入NLU监控仪表盘创建monitoring/config/grafana-dashboard.json{ title: RexUniNLU性能监控, panels: [ { title: 请求吞吐量, type: graph, targets: [{ expr: rate(nlu_requests_total[1m]), legendFormat: 请求速率 }] }, { title: 响应时间分布, type: graph, targets: [{ expr: histogram_quantile(0.95, rate(nlu_request_duration_seconds_bucket[5m])), legendFormat: P95响应时间 }] }, { title: 成功率, type: stat, targets: [{ expr: nlu_success_rate * 100, legendFormat: 成功率 }] } ] }8. 实际运行与测试启动增强版的NLU服务python server_with_monitoring.py测试服务是否正常# 测试健康检查 curl http://localhost:8000/health # 测试NLU功能 curl -X POST http://localhost:8000/nlu \ -H Content-Type: application/json \ -d { text: 帮我定一张明天去上海的机票, labels: [出发地, 目的地, 时间, 订票意图] } # 查看监控指标 curl http://localhost:8000/metrics9. 监控效果验证访问各个监控界面检查是否正常工作Prometheus: http://localhost:9090Grafana: http://localhost:3000查看日志文件:monitoring/logs/nlu_service.log你应该能看到Prometheus中能够采集到NLU服务的各项指标Grafana中能够看到请求量、响应时间、成功率等图表日志文件中记录了详细的请求和处理信息10. 总结通过本教程我们为RexUniNLU搭建了完整的监控体系核心收获学会了如何使用loguru进行结构化日志记录掌握了Prometheus监控指标的集成方法了解了如何通过Grafana可视化监控数据构建了从日志埋点到监控大盘的完整链路实际价值实时掌握服务运行状态快速发现性能瓶颈出现问题时能够快速定位和排查基于数据做容量规划和服务优化提升服务可靠性和用户体验下一步建议添加告警规则当成功率下降或响应时间变长时自动通知集成更多维度的监控指标如内存使用、GPU利用率等建立日志分析系统对识别结果进行质量分析设置自动化测试定期验证服务健康状态现在你的RexUniNLU服务已经从不透明的黑盒变成了完全可观测的白盒能够更好地服务于生产环境。获取更多AI镜像想探索更多AI镜像和应用场景访问 CSDN星图镜像广场提供丰富的预置镜像覆盖大模型推理、图像生成、视频生成、模型微调等多个领域支持一键部署。

更多文章