chore: 清理macOS同步产生的重复文件
详细说明: - 删除了352个带数字后缀的重复文件 - 更新.gitignore防止未来产生此类文件 - 这些文件是由iCloud或其他同步服务冲突产生的 - 不影响项目功能,仅清理冗余文件
This commit is contained in:
@@ -1,205 +0,0 @@
|
||||
# AI Workflow Builder Evaluations
|
||||
|
||||
This module provides a evaluation framework for testing the AI Workflow Builder's ability to generate correct n8n workflows from natural language prompts.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
The evaluation system is split into two distinct modes:
|
||||
1. **CLI Evaluation** - Runs predefined test cases locally with progress tracking
|
||||
2. **Langsmith Evaluation** - Integrates with Langsmith for dataset-based evaluation and experiment tracking
|
||||
|
||||
### Directory Structure
|
||||
|
||||
```
|
||||
evaluations/
|
||||
├── cli/ # CLI evaluation implementation
|
||||
│ ├── runner.ts # Main CLI evaluation orchestrator
|
||||
│ └── display.ts # Console output and progress tracking
|
||||
├── langsmith/ # Langsmith integration
|
||||
│ ├── evaluator.ts # Langsmith-compatible evaluator function
|
||||
│ └── runner.ts # Langsmith evaluation orchestrator
|
||||
├── core/ # Shared evaluation logic
|
||||
│ ├── environment.ts # Test environment setup and configuration
|
||||
│ └── test-runner.ts # Core test execution logic
|
||||
├── types/ # Type definitions
|
||||
│ ├── evaluation.ts # Evaluation result schemas
|
||||
│ ├── test-result.ts # Test result interfaces
|
||||
│ └── langsmith.ts # Langsmith-specific types and guards
|
||||
├── chains/ # LLM evaluation chains
|
||||
│ ├── test-case-generator.ts # Dynamic test case generation
|
||||
│ └── workflow-evaluator.ts # LLM-based workflow evaluation
|
||||
├── utils/ # Utility functions
|
||||
│ ├── evaluation-calculator.ts # Metrics calculation
|
||||
│ ├── evaluation-helpers.ts # Common helper functions
|
||||
│ ├── evaluation-reporter.ts # Report generation
|
||||
└── index.ts # Main entry point
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
### Core Components
|
||||
|
||||
#### 1. Test Runner (`core/test-runner.ts`)
|
||||
|
||||
The core test runner handles individual test execution:
|
||||
- Generates workflows using the WorkflowBuilderAgent
|
||||
- Validates generated workflows using type guards
|
||||
- Evaluates workflows against test criteria
|
||||
- Returns structured test results with error handling
|
||||
|
||||
#### 2. Environment Setup (`core/environment.ts`)
|
||||
|
||||
Centralizes environment configuration:
|
||||
- LLM initialization with API key validation
|
||||
- Langsmith client setup
|
||||
- Node types loading
|
||||
- Concurrency and test generation settings
|
||||
|
||||
#### 3. Langsmith Integration
|
||||
|
||||
The Langsmith integration provides two key components:
|
||||
|
||||
**Evaluator (`langsmith/evaluator.ts`):**
|
||||
- Converts Langsmith Run objects to evaluation inputs
|
||||
- Validates all data using type guards before processing
|
||||
- Safely extracts usage metadata without type coercion
|
||||
- Returns structured evaluation results
|
||||
|
||||
**Runner (`langsmith/runner.ts`):**
|
||||
- Creates workflow generation functions compatible with Langsmith
|
||||
- Validates message content before processing
|
||||
- Extracts usage metrics safely from message metadata
|
||||
- Handles dataset verification and error reporting
|
||||
|
||||
#### 4. CLI Evaluation
|
||||
|
||||
The CLI evaluation provides local testing capabilities:
|
||||
|
||||
**Runner (`cli/runner.ts`):**
|
||||
- Orchestrates parallel test execution with concurrency control
|
||||
- Manages test case generation when enabled
|
||||
- Generates detailed reports and saves results
|
||||
|
||||
**Display (`cli/display.ts`):**
|
||||
- Progress bar management for real-time feedback
|
||||
- Console output formatting
|
||||
- Error display and reporting
|
||||
|
||||
### Evaluation Metrics
|
||||
|
||||
The system evaluates workflows across five categories:
|
||||
|
||||
1. **Functionality** (30% weight)
|
||||
- Does the workflow achieve the intended goal?
|
||||
- Are the right nodes selected?
|
||||
|
||||
2. **Connections** (25% weight)
|
||||
- Are nodes properly connected?
|
||||
- Is data flow logical?
|
||||
|
||||
3. **Expressions** (20% weight)
|
||||
- Are n8n expressions syntactically correct?
|
||||
- Do they reference valid data paths?
|
||||
|
||||
4. **Node Configuration** (15% weight)
|
||||
- Are node parameters properly set?
|
||||
- Are required fields populated?
|
||||
|
||||
5. **Structural Similarity** (10% weight, optional)
|
||||
- How closely does the structure match a reference workflow?
|
||||
- Only evaluated when reference workflow is provided
|
||||
|
||||
### Violation Severity Levels
|
||||
|
||||
Violations are categorized by severity:
|
||||
- **Critical** (-40 to -50 points): Workflow-breaking issues
|
||||
- **Major** (-15 to -25 points): Significant problems affecting functionality
|
||||
- **Minor** (-5 to -15 points): Non-critical issues or inefficiencies
|
||||
|
||||
## Running Evaluations
|
||||
|
||||
### CLI Evaluation
|
||||
|
||||
```bash
|
||||
# Run with default settings
|
||||
pnpm eval
|
||||
|
||||
# With additional generated test cases
|
||||
GENERATE_TEST_CASES=true pnpm eval
|
||||
|
||||
# With custom concurrency
|
||||
EVALUATION_CONCURRENCY=10 pnpm eval
|
||||
```
|
||||
|
||||
### Langsmith Evaluation
|
||||
|
||||
```bash
|
||||
# Set required environment variables
|
||||
export LANGSMITH_API_KEY=your_api_key
|
||||
# Optionally specify dataset
|
||||
export LANGSMITH_DATASET_NAME=your_dataset_name
|
||||
|
||||
# Run evaluation
|
||||
pnpm eval:langsmith
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Required Files
|
||||
|
||||
#### nodes.json
|
||||
**IMPORTANT**: The evaluation framework requires a `nodes.json` file in the evaluations root directory (`evaluations/nodes.json`).
|
||||
|
||||
This file contains all n8n node type definitions and is used by the AI Workflow Builder agent to:
|
||||
- Know what nodes are available in n8n
|
||||
- Understand node parameters and their schemas
|
||||
- Generate valid workflows with proper node configurations
|
||||
|
||||
**Why is this required?**
|
||||
The AI Workflow Builder agent needs access to node definitions to generate workflows. In a normal n8n runtime, these definitions are loaded automatically. However, since the evaluation framework instantiates the agent without a running n8n instance, we must provide the node definitions manually via `nodes.json`.
|
||||
|
||||
**How to generate nodes.json:**
|
||||
1. Run your n8n instance
|
||||
2. Download the node definitions from locally running n8n instance(http://localhost:5678/types/nodes.json)
|
||||
3. Save the node definitions to `evaluations/nodes.json`
|
||||
|
||||
The evaluation will fail with a clear error message if `nodes.json` is missing.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
- `N8N_AI_ANTHROPIC_KEY` - Required for LLM access
|
||||
- `LANGSMITH_API_KEY` - Required for Langsmith evaluation
|
||||
- `USE_LANGSMITH_EVAL` - Set to "true" to use Langsmith mode
|
||||
- `LANGSMITH_DATASET_NAME` - Override default dataset name
|
||||
- `EVALUATION_CONCURRENCY` - Number of parallel test executions (default: 5)
|
||||
- `GENERATE_TEST_CASES` - Set to "true" to generate additional test cases
|
||||
- `LLM_MODEL` - Model identifier for metadata tracking
|
||||
|
||||
## Output
|
||||
|
||||
### CLI Evaluation Output
|
||||
|
||||
- **Console Display**: Real-time progress, test results, and summary statistics
|
||||
- **Markdown Report**: `results/evaluation-report-[timestamp].md`
|
||||
- **JSON Results**: `results/evaluation-results-[timestamp].json`
|
||||
|
||||
### Langsmith Evaluation Output
|
||||
|
||||
- Results are stored in Langsmith dashboard
|
||||
- Experiment name format: `workflow-builder-evaluation-[date]`
|
||||
- Includes detailed metrics for each evaluation category
|
||||
|
||||
## Adding New Test Cases
|
||||
|
||||
Test cases are defined in `chains/test-case-generator.ts`. Each test case requires:
|
||||
- `id`: Unique identifier
|
||||
- `name`: Descriptive name
|
||||
- `prompt`: Natural language description of the workflow to generate
|
||||
- `referenceWorkflow` (optional): Expected workflow structure for comparison
|
||||
|
||||
## Extending the Framework
|
||||
|
||||
To add new evaluation metrics:
|
||||
1. Update the `EvaluationResult` schema in `types/evaluation.ts`
|
||||
2. Modify the evaluation logic in `chains/workflow-evaluator.ts`
|
||||
3. Update the evaluator in `langsmith/evaluator.ts` to include new metrics
|
||||
4. Adjust weight calculations in `utils/evaluation-calculator.ts`
|
||||
@@ -1,27 +0,0 @@
|
||||
import { runCliEvaluation } from './cli/runner.js';
|
||||
import { runLangsmithEvaluation } from './langsmith/runner.js';
|
||||
|
||||
// Re-export for external use if needed
|
||||
export { runCliEvaluation } from './cli/runner.js';
|
||||
export { runLangsmithEvaluation } from './langsmith/runner.js';
|
||||
export { runSingleTest } from './core/test-runner.js';
|
||||
export { setupTestEnvironment, createAgent } from './core/environment.js';
|
||||
|
||||
/**
|
||||
* Main entry point for evaluation
|
||||
* Determines which evaluation mode to run based on environment variables
|
||||
*/
|
||||
async function main(): Promise<void> {
|
||||
const useLangsmith = process.env.USE_LANGSMITH_EVAL === 'true';
|
||||
|
||||
if (useLangsmith) {
|
||||
await runLangsmithEvaluation();
|
||||
} else {
|
||||
await runCliEvaluation();
|
||||
}
|
||||
}
|
||||
|
||||
// Run if called directly
|
||||
if (require.main === module) {
|
||||
main().catch(console.error);
|
||||
}
|
||||
@@ -1,106 +0,0 @@
|
||||
import { readFileSync, existsSync } from 'fs';
|
||||
import { jsonParse, type INodeTypeDescription } from 'n8n-workflow';
|
||||
import { join } from 'path';
|
||||
|
||||
interface NodeWithVersion extends INodeTypeDescription {
|
||||
version: number | number[];
|
||||
defaultVersion?: number;
|
||||
}
|
||||
|
||||
export function loadNodesFromFile(): INodeTypeDescription[] {
|
||||
console.log('Loading nodes from nodes.json...');
|
||||
|
||||
const nodesPath = join(__dirname, 'nodes.json');
|
||||
|
||||
// Check if nodes.json exists
|
||||
if (!existsSync(nodesPath)) {
|
||||
const errorMessage = `
|
||||
ERROR: nodes.json file not found at ${nodesPath}
|
||||
|
||||
The nodes.json file is required for evaluations to work properly.
|
||||
Please ensure nodes.json is present in the evaluations root directory.
|
||||
|
||||
To generate nodes.json:
|
||||
1. Run the n8n instance
|
||||
2. Export the node definitions to evaluations/nodes.json
|
||||
3. This file contains all available n8n node type definitions needed for validation
|
||||
|
||||
Without nodes.json, the evaluator cannot validate node types and parameters.
|
||||
`;
|
||||
console.error(errorMessage);
|
||||
throw new Error('nodes.json file not found. See console output for details.');
|
||||
}
|
||||
|
||||
const nodesData = readFileSync(nodesPath, 'utf-8');
|
||||
const allNodes = jsonParse<NodeWithVersion[]>(nodesData);
|
||||
|
||||
console.log(`Total nodes loaded: ${allNodes.length}`);
|
||||
|
||||
// Group nodes by name
|
||||
const nodesByName = new Map<string, NodeWithVersion[]>();
|
||||
|
||||
for (const node of allNodes) {
|
||||
const existing = nodesByName.get(node.name) ?? [];
|
||||
existing.push(node);
|
||||
nodesByName.set(node.name, existing);
|
||||
}
|
||||
|
||||
console.log(`Unique node types: ${nodesByName.size}`);
|
||||
|
||||
// Extract latest version for each node
|
||||
const latestNodes: INodeTypeDescription[] = [];
|
||||
let multiVersionCount = 0;
|
||||
|
||||
for (const [_nodeName, versions] of nodesByName.entries()) {
|
||||
if (versions.length > 1) {
|
||||
multiVersionCount++;
|
||||
// Find the node with the default version
|
||||
let selectedNode: NodeWithVersion | undefined;
|
||||
|
||||
for (const node of versions) {
|
||||
// Select the node that matches the default version
|
||||
if (node.defaultVersion !== undefined) {
|
||||
if (Array.isArray(node.version)) {
|
||||
// For array versions, check if it includes the default version
|
||||
if (node.version.includes(node.defaultVersion)) {
|
||||
selectedNode = node;
|
||||
}
|
||||
} else if (node.version === node.defaultVersion) {
|
||||
selectedNode = node;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// If we found a matching node, use it; otherwise use the first one
|
||||
if (selectedNode) {
|
||||
latestNodes.push(selectedNode);
|
||||
} else {
|
||||
latestNodes.push(versions[0]);
|
||||
}
|
||||
} else {
|
||||
// Single version node
|
||||
latestNodes.push(versions[0]);
|
||||
}
|
||||
}
|
||||
|
||||
console.log(`\nNodes with multiple versions: ${multiVersionCount}`);
|
||||
console.log(`Final node count: ${latestNodes.length}`);
|
||||
|
||||
// Filter out hidden nodes
|
||||
const visibleNodes = latestNodes.filter((node) => !node.hidden);
|
||||
console.log(`Visible nodes (after filtering hidden): ${visibleNodes.length}\n`);
|
||||
|
||||
return visibleNodes;
|
||||
}
|
||||
|
||||
// Helper function to get specific node version for testing
|
||||
export function getNodeVersion(nodes: INodeTypeDescription[], nodeName: string): string {
|
||||
const node = nodes.find((n) => n.name === nodeName);
|
||||
if (!node) return 'not found';
|
||||
|
||||
const version = (node as NodeWithVersion).version;
|
||||
if (Array.isArray(version)) {
|
||||
return `[${version.join(', ')}]`;
|
||||
}
|
||||
return version?.toString() || 'unknown';
|
||||
}
|
||||
Reference in New Issue
Block a user