Core Concepts

Data Hub

Seamlessly integrate and unify your enterprise data.

Centralize data management for enhanced operational efficiency.
Effortlessly upload, register, and organize documents at scale.
Unlock insights with robust business intelligence reporting tools.

Connection Manager

Sources

A source refers to any system from which data is ingested, such as an API, file, database, or data warehouse.

Always test your connection before proceeding
Ensure proper credentials are provided
Verify data access permissions

Overview

A source refers to any system from which data is ingested, such as an API, file, database, or data warehouse. Setting up a source involves configuring the necessary variables that allow the connector to access and retrieve data. The specific configuration fields may vary depending on the type of connector but typically include credentials for authentication (e.g., username and password, API key), as well as parameters that determine which data to extract. For example, this may involve specifying a start date for syncing records or defining a search query to match the records.

Source Definition

The definition of a source depends on the type of system—whether it's an API, database, file, or data warehouse—and the parameters required for secure connection or authentication. The configuration fields are determined by the connector type and the security protocols necessary to access the data.

Test Connection

After providing the source configuration details, the Test Connection function is used to validate whether the authentication information is correct. It also verifies that the system can successfully connect to the source using the supplied credentials.

Library

Below are examples of top sources that the Vue.ai platform supports for integration:

Google Sheets
PostgreSQL
Amazon Redshift
HubSpot
Shopify

CDK

If the desired source is not available in the list of supported connectors, users can utilize the Connector Development Kit (CDK) to build a custom connector. This kit enables the creation of a custom Python-based source to integrate with the system.

Destinations

Overview

A destination refers to the target system where ingested data is loaded, such as a data warehouse, data lake, database, analytics tool, or the Vue.ai platform.

Destination Definition

The definition of a destination depends on the type of system—whether it's a data warehouse, data lake, database, analytics tool, or the Vue.ai platform. It involves configuring the necessary parameters to establish a connection for loading ingested data into the target system. The specific configuration fields vary based on the connector type and the security protocols required to ensure secure data transfer.

Vue Dataset

The Vue Dataset acts as a destination for loading data into the Vue.ai platform's data catalog. It is used for any data that needs to undergo workflow processes or report generation within the platform. By using Vue Dataset as a destination, users can fully leverage the data analysis and management features of the Vue.ai platform.

Test Connection

Once the destination configuration details are provided, the Test Connection function is used to verify the accuracy of the authentication details and confirm that the system can successfully connect to the destination. This step ensures that data can be seamlessly loaded into the target system without any issues.

Library

Below are examples of top destinations that the Vue.ai platform supports for integration:

PostgreSQL
Amazon Redshift
Vue Dataset

CDK

If the desired destination is not listed among the supported connectors, users can leverage the Connector Development Kit (CDK) to create a custom connector. This enables the development of a tailored destination, allowing seamless integration with the chosen system.

Connections

Overview

A Connection refers to an automated data pipeline that replicates data from a source to a destination. It establishes a link between a configured source (using a source connector) and a configured destination (using a destination connector) to enable data synchronization. The connection defines essential parameters, including the frequency of replication (e.g., hourly, daily, or manual) and the specific data streams to be replicated.

Stream

A stream represents a collection of related records. Depending on the destination, it may be referred to as a table, file, or blob. The term "stream" is used to generalize the flow of data across different destinations.

Examples of Streams:

A table in a relational database
A resource or API endpoint in a REST API
Records from a directory containing multiple files in a filesystem

Record

A record is an individual entry or unit of data, often referred to as a "row." Each record is typically unique and encapsulates information related to a specific entity, such as a customer or transaction.

Examples of Records:

A row in a relational database table
A line within a data file
A data unit retrieved from an API response

Batch

A batch refers to a group of records processed and transferred together as a single unit. Batching is commonly used to efficiently transfer large volumes of data instead of processing records individually.

Examples of Batches:

A collection of rows in a relational database that are updated simultaneously
A set of files transferred together during a data migration
Multiple data entries sent in a single API request

Cursor Field

A cursor field represents a specific attribute of a record within a stream. When the source synchronization mode is configured for incremental updates, a cursor field is recommended as it serves as a primary key, uniquely identifying each record.

Examples of Fields:

A column within a relational database table
An attribute within an API response

Sync Schedule

There are three methods available for scheduling synchronization:

Scheduled: This option allows you to set sync intervals, such as every 24 hours or every 2 hours.
CRON Schedule: For more advanced scheduling, you can use a CRON expression to define specific timing for sync operations.
Manual: You can initiate a sync manually by clicking the "Sync Now" button in the user interface or through the API.

Change Data Capture

Change Data Capture (CDC) is the process of capturing and tracking changes made to a database, including inserts, updates, and deletes. CDC ensures that the target system stays synchronized with the source database in real-time, which is essential for data warehousing, business intelligence, and replication scenarios.

Sync Modes

Sync modes define how data is retrieved from a source and transferred to a destination. Vue.ai offers several sync modes to meet different objectives, each impacting how synchronization occurs and whether duplicate records will be generated in the destination.

Sync modes are defined by two components: Source Sync Mode and Destination Sync Mode.

Source Sync Mode

This part describes how data is read from the source:

Mode	Description
Incremental	Reads only the records added since the last sync. The first sync acts as a Full Refresh.
Method 1	Uses a cursor to track the last processed record, allowing for the retrieval of new records.
Method 2	Captures changes in real-time and is supported by some sources. For more details, refer to the CDC documentation.
Full Refresh	Reads all records from the source, regardless of previous syncs.

Destination Sync Mode

This part specifies how data is written to the destination:

Mode	Description
Overwrite	Replaces existing data in the destination with new data.
Append	Adds new data to existing tables without altering any pre-existing records.
Append Deduped	Appends data to existing tables while keeping a history of changes. The final table is de-duplicated using a primary key.
Overwrite Deduped	Replaces existing data and removes duplicates from the final dataset, ensuring uniqueness with a primary key.

Destination Stream Name

Available exclusively for the Vue Dataset destination, this feature represents the dataset where you intend to load your data for integration into the system. Users can either use an existing dataset with a matching schema or create a new dataset to load the data.

Validation Reports

After each successful sync run, a validation report is generated. It includes key metrics and definitions:

Key	Description
`attempt`	Indicates the number of the current sync attempt.
`bytesSynced`	The total number of bytes that were successfully synced during the attempt.
`recordsSynced`	The total number of records that were successfully synced.
`totalStats`	An object containing aggregate statistics for the sync attempt.
`recordsEmitted`	The total number of records emitted (processed) during the sync.
`bytesEmitted`	The total number of bytes emitted during the sync.
`stateMessagesEmitted`	The number of state messages emitted during the sync process.
`recordsCommitted`	The total number of records that were successfully committed to the destination.
`streamStats`	An array that provides detailed statistics for individual data streams that were processed during the sync.

Each object within the streamStats array includes:

Key	Description
`streamName`	The name of the data stream being reported on.
`stats`	An object containing statistics specific to the stream.
`recordsEmitted`	The number of records emitted for this specific stream.
`bytesEmitted`	The total bytes emitted for this stream.
`recordsCommitted`	The number of records that were successfully committed for this stream.
`failureSummary`	Provides information about any failures that occurred during the sync attempt. A value of null indicates that there were no failures in this attempt.

Document Manager

The Document Manager provides comprehensive capabilities for intelligent document processing, from defining document types and taxonomies to executing complex extraction workflows.

Document Types

Intelligent Document Processing (IDP) in the Document Manager automates the extraction of specific information from documents. This is achieved by creating and managing reusable templates that teach the AI what information to look for and where to find it. The entire system is built upon two fundamental concepts: the Document Type and its corresponding Taxonomy.

A Document Type is a reusable blueprint or model that defines a specific category of document. Think of it as a template for processing "Invoices," "US Driver's Licenses," or "Bank Statements." By creating a distinct Document Type for each, you tell the system how to handle them individually.

Each Document Type is defined by two key aspects: its structure (the physical layout) and its taxonomy (the data to be extracted).

Document Structure and Handling

This setting informs the AI model about the document's structural consistency and any required pre-processing steps. Choosing the correct option is crucial because it helps the model decide which cues (visual or semantic) to prioritize, leading to higher accuracy.

Layout Categories

These categories describe the inherent structure of the document itself.

Structured
- Definition: These documents have a fixed, unchanging format where data fields appear in the same position on every instance of the document.
- Examples: Government forms (like a W-9), application forms, passports.
- Why It Matters: For structured documents, the AI model heavily relies on spatial cues (the physical location of text). Once it learns that the "Date of Birth" is in a specific spot, it will look there first on all subsequent documents.
Semi-structured
- Definition: These documents contain a similar set of information, but the layout can vary from one instance to another. They have a predictable structure but not a fixed format.
- Examples: Invoices (different vendors have different templates), purchase orders, receipts, and bank statements. The Invoice Number will always be present, but its location can change.
- Why It Matters: The AI model uses a combination of spatial cues and semantic understanding. It knows to look for a field labeled "Invoice #" or a value that looks like an invoice number, regardless of its exact position.
Unstructured
- Definition: These documents have no predefined format or consistent layout. The required information is embedded within free-flowing text.
- Examples: Legal contracts, emails, business reports, and press releases.
- Why It Matters: The model relies almost entirely on semantic relations and context to find the relevant information. It understands language to identify a "contract start date" or "termination clause" based on the surrounding sentences.

Specialized Document Handlers

These are specialized pre-processing workflows that handle common, complex scenarios before the extraction logic is applied.

ID Cards
- What it does: This option is designed for images or pages containing one or more ID cards. The system first runs a detection model to find and crop each individual card.
- Why it matters: It automatically isolates each card, treating it as a separate page for extraction. This is essential when a single scanned image contains multiple IDs, ensuring that data from one card isn't confused with another.
Doc Detect
- What it does: Ideal for multi-document PDFs where different document types are combined into one file (e.g., an application packet containing an application form, a driver's license, and a bank statement). This feature allows you to classify each page or group of pages.
- Why it matters: It enables the system to split a single file into logical sub-documents and then apply the correct Document Type extraction logic to each distinct section, automating complex document separation tasks.
Bank Statement
- This is a pre-configured template optimized for the semi-structured nature of bank statements, providing a head start on building the taxonomy for this common document type.

How Document Types Link to Extraction

Registering a Document Type is the training step. Once a Document Type is finalized and Registered, it becomes an active model. You can then upload new documents and assign them to that type. The system will apply the learned layout rules and taxonomy to automatically extract the specified data fields from the new document.

Taxonomy Overview

The Taxonomy is the heart of a Document Type. It is the complete, structured list of all the data fields (or attributes) that you want to extract from that document.

Attribute Properties

Each attribute in the taxonomy is defined by a set of properties that control the extraction process.

Name: The unique, user-friendly name for the data field (e.g., Date of Expiry).
Annotation: The visual link between the attribute and its location on the example document, created by drawing a bounding box.
Type: The data type of the attribute. Specifying the correct type enables data validation and specific formatting rules. Common types include:
- Alpha Numeric, Barcode, Checkbox, Date, Enum, Free Form Text, Name, Numeric, Signature.
- Table: A special, complex type for extracting structured data from grids or tables.
Enable Redaction: A security feature for masking personally identifiable information (PII) or other sensitive data.
Tags: Keywords or labels you can assign to an attribute for organization and filtering.
Description: A brief, plain-language explanation of what the attribute represents.
Instruction: Critical context or hints that guide both the AI model and human reviewers. For the model, it acts as a prompt to disambiguate information.

The Table Attribute Type

Extracting data from tables is a common requirement and is treated as a special attribute type. A Table attribute is not a single value but a collection of rows and columns, and it comes with its own powerful configuration interface.

Annotating a Table:

The process of annotating a table is more interactive than for simple fields:

Initial Setup: In the attribute properties pane, you provide an initial estimate for the number of Columns and Rows in the table. You also specify if the First row is header.
Bounding Box: You draw a single bounding box that encompasses the entire table area on the document.
Grid Adjustment: The system overlays a grid based on your initial setup. You can then interactively drag the horizontal and vertical grid lines to precisely align them with the cell borders of the table in the document, ensuring perfect cell detection.

Configuring the Table Schema:

After annotating the table's location, you must define its internal structure by clicking Configure Columns. This opens a new view where you define the schema for the data to be extracted. For each column you want to capture, you can configure the following properties:

Header: The standardized, canonical name you want to use for this column in your final data output (e.g., date_of_birth, item_sku). This ensures a consistent schema regardless of how the header is written in the document.
Alias: A list of possible header names that might appear in different versions of the document. This is a powerful mapping tool. For example, your standardized Header might be person_name, but the Aliases could include "Name of Person," "Full Name," and "Applicant Name." The system will recognize any of these aliases and map the data to the correct standardized header.
Strict Matching: A toggle that controls matching behavior.
- Enabled: The system will only extract data for this column if the header in the document is an exact match to one of the specified aliases.
- Disabled: The system can use more flexible semantic matching to identify the column, even if the header text doesn't match an alias perfectly.
Data Type: Just like regular attributes, you can assign a specific data type (Numeric, Date, Text, etc.) to each individual column. This enables validation and formatting at the column level.
Description & Instruction: Field-level guidance for each specific column, providing context for both the AI model and any human reviewers.

By defining a table schema, you transform messy, variable table data into a clean, structured, and predictable JSON output (typically an array of objects) ready for use in downstream systems.

Key Processes and Statuses

0-Shot Extraction

When you first upload a sample document, the system performs a 0-shot extraction—an automated pass to identify potential fields. This gives you a pre-populated taxonomy as a starting point to refine.

Document Status: Draft vs. Registered

Draft: The "in-progress" state for building and refining a Document Type.
Registered: The final, "live" state. A registered model is ready to be used for processing new documents.

Taxonomy

The taxonomy defines the full set of attributes to be extracted from the document. It includes the following components:

Attribute Name: Specifies the unique name of each attribute.
Data Type: Indicates the type of data associated with the attribute (e.g., text, number, date).
Formatting Configuration: Outlines any specific formatting required for the attribute.

Currently, the taxonomy supports a single-level hierarchy, except in cases involving tables. For multi-page tables, a hierarchical structure applies: the top level displays the merged table, while the underlying individual tables are indented one level below.

Taxonomy Actions

Add New Attribute

To add a new attribute to the taxonomy:

Click + Add New.
Enter a name for the attribute.
Specify the attribute's data type.
Annotate, if necessary.
Click Save.

Delete Attribute

To delete an existing attribute:

Hover over the attribute you wish to delete.
Open the context menu and select Delete.
Confirm the deletion action when prompted.

Multi-Select Attributes

For bulk actions on multiple attributes, you can use the multi-select feature. Currently, the primary bulk action supported is deletion.

To select all attributes, click Select All. Alternatively, use the checkboxes next to each attribute to manually select or deselect items.
Once your selection is complete, choose Delete to remove the selected attributes.

Document Extraction

After a Document Type is registered, the platform is ready to process documents. The lifecycle of a document involves several key stages, from ingestion and extraction to data storage, post-processing, and performance monitoring.

Document Ingestion Methods

There are two primary ways to send documents to the platform for processing:

UI Upload: Users can manually upload individual files or batches of documents directly through the Documents Hub. This is ideal for ad-hoc tasks, bulk processing of historical files, or workflows where human operators are the starting point.
API Integration: For automated, high-volume workflows, documents can be submitted programmatically via a dedicated REST API endpoint. This allows you to integrate the IDP service directly into your existing applications, such as an email intake system, a mobile app for receipt scanning, or a legacy system's document queue.

The Extraction Workflow

Once a document is ingested, it triggers a workflow in the Automation Hub. This workflow typically consists of:

Classification (Optional): If you used Auto Classify, the first step is to identify the correct Document Type.
Extraction: The system applies the taxonomy of the identified Document Type to extract the relevant data fields.
Post-processing (Optional): After extraction, the data can be passed to subsequent nodes in the workflow for further refinement.

Extensibility with Code Nodes

The true power of the platform lies in its extensibility. You can add custom Code Nodes (e.g., Python scripts) to the workflow after the IDP step. This enables limitless post-processing possibilities, such as:

Custom Validation: Implementing complex business rules that go beyond standard data types (e.g., "ensure the delivery date is after the order date").
Data Enrichment: Calling an external API to enrich the extracted data (e.g., using an address to look up its coordinates).
Custom Formatting: Transforming the data into a specific format required by a downstream system.

Human-in-the-Loop: Review and Annotation

After the automated workflow completes, documents are typically set to a Pending status for human review. This "human-in-the-loop" step is crucial for:

Quality Assurance: Correcting any errors made by the AI.
Handling Low-Confidence Extractions: Focusing operator attention on fields where the model was uncertain.
Continuous Learning: Every correction made during annotation serves as feedback that helps retrain and improve the AI models over time.

Data Storage and Organization

Extracted data is not just a temporary result; it is stored in a structured and accessible way within the platform's Data Hub, populating two distinct datasets:

Documents Dataset (Transactional)

Structure: Each row represents a single uploaded document.
Columns: Includes metadata like Document ID, Document Name, Status (Pending/Reviewed), Assigned User, Batch Name, and Tags.
Purpose: Optimized for operational and transactional queries, such as "Find all pending documents assigned to User A" or "Show me all documents from the 'Q4_Invoices' batch."

Document Extraction Dataset (Analytical)

Structure: Each row represents a single attribute extracted from a document (a "long" format).
Columns: Includes Document ID, Attribute Name, Extracted Value, Confidence Score, Data Type, etc.
Purpose: Optimized for analytical queries across many documents. For example, "What is the average 'Total Amount' for all invoices this month?" or "How many driver's licenses expire in the next 90 days?"

This dual-dataset approach provides flexibility, allowing other platform features like Datasets and Workflows to easily query, join, and act upon the extracted information.

Exporting Data and Documents

You can export data directly from the Documents Hub. The export functionality is flexible, allowing you to download:

Extracted Data: In structured formats like JSON (for hierarchical data) and CSV (for tabular analysis).
The Documents Themselves: You can download the original files or versions with redaction applied, ensuring that sensitive information is masked before the document leaves the platform.

Performance Monitoring and Metrics

The lifecycle is complete once you monitor the performance of your Document Types. From the Document Type listing page, you can generate and view metrics that provide insight into:

Usage Metrics: Volume of documents processed, number of pages, processing times, etc.
Model Performance: Key metrics like field-level accuracy, straight-through processing (STP) rates (i.e., documents requiring no human correction), and the distribution of confidence scores.

This data is essential for understanding your ROI, identifying bottlenecks, and deciding which Document Types may need further refinement.

Dataset Manager

The Dataset Manager provides comprehensive capabilities for organizing, managing, and analyzing structured data within the Vue.ai platform.

Data Catalog

What is a Data Catalog?

A data catalog is a centralized repository that stores metadata about data assets within a platform or organization. It enables centralized access control, auditing, lineage, and data discovery, serving as a comprehensive inventory of all available data sources. The catalog provides detailed information about each dataset, including its structure, location, ownership, usage, and other relevant attributes.

By leveraging a data catalog, users can efficiently discover, understand, access, and manage data, ensuring streamlined operations and enhanced collaboration.

Key Features of a Data Catalog

Centralized Metadata Management
- Stores and manages metadata, such as data definitions, descriptions, schemas, and lineage.
- Fosters collaboration across roles, including data consumers, analysts, data scientists, machine learning engineers, marketing, sales teams, and more.
Data Discovery
- Facilitates the search and exploration of datasets using metadata.
- Enables users to locate relevant datasets from a centralized repository by applying various filters and criteria.
Data Lineage
- Tracks the origin, source, and transformations of data throughout its lifecycle.
- Provides insights into how data flows and is manipulated across different systems and processes.
Data Auditing and Governance
- Captures user-level audit logs that record access to datasets and dataset groups.
- Enforces policies and standards for data usage, security, and compliance within the platform.
Permissions and Collaboration
- Grants users secure access to raw data stored in the cloud via credentials.
- Enables annotation, commenting, and sharing of insights about datasets and dataset groups, enriching metadata and fostering collaboration.
Data Quality Management
- Assesses and monitors the quality of data.
- Measures key metrics such as completeness and consistency at both the dataset and dataset group levels.

Datasets

What is a Dataset?

A dataset is a collection of data, organized and structured according to a predefined schema. This schema enforces a consistent structure, ensuring that the data can be effectively utilized for various purposes.

Datasets serve as cohesive units of information related to a specific purpose and are fundamental to a data catalog.

Common Use Cases of Datasets:

Transactional purposes: Managing and recording business transactions.
Processing workloads: Supporting ETL (Extract, Transform, Load) operations or machine learning and data science workflows.
Analytics and insights: Driving business decisions by analyzing data to solve specific problems.

What is the Difference Between Data and Metadata?

Data: Represents the actual content or raw information stored, processed, or analyzed. It consists of facts, observations, or values in various forms such as text, numbers, images, or videos.
- Example: In a dataset of customer records, data includes attributes like customer name, age, address, and purchase history.
Metadata: Provides context and descriptive information about the data, facilitating its understanding, management, and usage. Metadata does not contain actual data content but describes it.

Types of Metadata:

Descriptive: Summarizes the data, including titles, tags, and keywords.
Structural: Defines how data is organized, such as schemas, tables, fields, or columns.
Administrative: Details ownership, creation/modification history, and access controls.
Technical: Specifies technical aspects like file format, size, encoding, data source, quality, or lineage.

What is a Row (or Record) in a Dataset?

A row or record in a dataset can contain:

Unstructured text: Free-form strings.
Boolean values: True/false.
Numerical values: Integers or floating-point numbers.
Categorical variables: Predefined categories represented as strings or Booleans.
Structured text: JSON objects, arrays, or vectors.
Cloud object references: Links to files like images, documents, or audio stored in the cloud.

The choice of data type depends on the data's nature, intended use, and storage requirements.

Entities that Define a Dataset

Dataset Description:
A brief overview of the dataset and its content. This can be manually entered or auto-generated using the magic wand feature (🪄).
Dataset Size and Number of Records:
Details the number of rows and the storage size of the dataset.
Tags:
Labels that help in organizing datasets logically, enabling easier search and identification.
Dataset Schema:
An overview of the dataset's composition, including columns and their data types. Column descriptions can also serve as a data dictionary for better understanding.

What is Dataset Profiling?

Dataset profiling examines underlying data and schema details to uncover distribution patterns, outliers, and dependencies. This aids in effective data management.

Schema Details in Profiling:

Validations
Memory (MiB)
Distinct values: Count and percentage of unique values.
Negative values: Count and percentage of negative values.
Missing values: Count and percentage of missing values.
Mean: Average value of the column.
Min/Max: Minimum and maximum column values.
Correlation: Relationship with other columns.
Count: Total column values.

What is Dataset Sampling and Why is it Done?

Sampling involves selecting a representative subset of data from a larger dataset for analysis. Common sampling methods include random selection or choosing top/bottom N records. The sample size is critical to ensure accurate representation.

Metrics

Metrics and reports enable slicing, dicing, analytics, and summarization of datasets. Users can create aggregations or metrics based on dataset columns and visualize them using charts.

Options for Metrics Creation:

Utilize the platform's built-in intelligence to auto-generate reports.
Manually create customized reports to meet specific requirements.

Dataset Groups

Dataset groups are organizational units that logically group related datasets to support isolation, management, analysis, and governance. They enable effective organization and management within a data ecosystem, similar to folders in a file system or schemas in a database, facilitating hierarchical organization of datasets.

Dataset groups can isolate datasets relevant to specific use cases, projects, or security boundaries, creating a structure that supports efficient data handling and governance.

Data Relationships

A relationship is a connection or association between two or more datasets, illustrating how they are linked within a dataset group. These relationships are essential for understanding interconnections among datasets, enabling more comprehensive metadata management, analysis, and exploration of the ecosystem.

By defining relationships within a dataset group, users gain insights into data connections, enabling downstream tasks—such as data processing, management, and analysis—that drive business outcomes.

Cardinality

Cardinality defines the number of instances of one dataset that relate to instances of another through a specific relationship. There are three main types of cardinality:

One-to-One (1:1) Cardinality
- Each instance of one entity is associated with exactly one instance of another entity.
One-to-Many (1:N) Cardinality
- Each instance of one entity can relate to multiple instances of another, while each instance of the second entity relates to only one instance of the first.
- The reverse, a many-to-one relationship, can also exist.
Many-to-Many (N:M or N:N) Cardinality
- Multiple instances of one entity relate to multiple instances of another.
- This often requires an intermediary (junction) table to represent the relationship.

Entity-Relationship (ER) Diagrams

Entity-Relationship (ER) diagrams are graphical representations used to model database structures, depicting:

Entities: Represented as datasets
Attributes: Characteristics or schema details of the datasets
Relationships: Connections between datasets within a database, including defined cardinality

ER diagrams serve as a blueprint for understanding dataset structures and relationships, helping stakeholders make informed decisions and align with data requirements.

Purpose of Dataset Groups

Dataset groups serve several purposes:

Logical Organization
- Provides a structure for organizing datasets based on relationships and domains
- Allows efficient navigation and discovery of datasets, particularly in large data ecosystems
Data & Metadata Management
- Facilitates data management by grouping datasets based on shared characteristics
- Enhances metadata for datasets with similar properties
Relationship Management
- Enables understanding of relationships within a dataset group
- Serves as documentation, aiding schema and relationship comprehension for workflows
Permissions
- Simplifies access control and permissions management for datasets within the group
Data Analysis and Exploration
- Allows users to explore patterns and correlations within the group
- Helps stakeholders visualize and solve problems effectively
Data Governance
- Enforces data governance policies, including quality checks, privacy rules, and compliance standards
Collaboration and Sharing
- Creates a shared environment for accessing and working with datasets

Constraints and Behavior

Dataset and Dataset Group Constraints

A dataset can belong to only one dataset group
Two datasets within the same group cannot share the same name

Adding Datasets to Groups

With the right permissions, users can add datasets to groups in the following ways:

Defining Relationships
- Users can view all datasets outside the group to select those to add
Direct Addition
- Users can add datasets directly from the dataset group's information page
Creating a Dataset
- New datasets can be added to an existing group (without a predefined relationship)
- Assigned to a new group
- Left ungrouped

Automation Hub

Design advanced analytics and machine learning workflows tailored to your needs.

Create custom nodes and automate processes for specific problem statements.
Streamline the design and execution of workflows with advanced automation capabilities, enabling scalable and efficient data and computational processes.

Workflow Concepts

Overview

A workflow is a structured series of tasks executed in a specific order to achieve a particular objective, represented as a Directed Acyclic Graph (DAG). Workflows automate and manage complex processes, offering flexible control over task execution, condition-based decision-making, and support for both automated and human-in-the-loop operations. They can be triggered in various ways, handle both one-time and recurring tasks, and provide real-time monitoring for tracking progress and outcomes.

Vue.ai's Enterprise AI Orchestration Platform supports various types of workflows, including analytical workflows, data science/machine learning (DS/ML) workflows, and business process workflows, each tailored to meet diverse operational needs.

Key Features

No-Code Workflow Building

Empowers users with a no-code, drag-and-drop interface, allowing for effortless creation of analytical, data science, and business process workflows on the Enterprise AI Orchestration Platform.

Polyglot Compute Orchestration

Unlocks the power of polyglot compute with Vue.ai’s platform, enabling users to seamlessly integrate nodes across multiple programming languages and runtimes within a single workflow. This multi-environment orchestration provides flexibility and performance, allowing users to leverage the best tools for each task while maintaining an efficient workflow experience.

Rapid Prototyping

Accelerate development cycles with Vue.ai’s rapid prototyping feature, running workflows in speed-run mode on data samples for quick verification and iteration, enhancing productivity.

Workflow Builder

The Workflow Builder is a low-code, drag-and-drop tool designed to simplify the creation of workflows. It allows users to configure and build complex workflows with ease, offering both form-driven and JSON-driven editors to sequence nodes. Users can search through a library of nodes, including marketplace and custom options, and create custom nodes with their preferred engines, providing flexibility in customization.

Engines

Engines bring polyglot compute power to workflows, allowing nodes to use different programming languages and environments based on task-specific requirements. This versatility enables workflows to incorporate a variety of engines, including Python-based tasks with Pandas, large-scale analytics with Spark, or custom applications in other languages. Supporting multiple languages and runtimes enhances the capability and efficiency of workflows for handling diverse processing needs.

Built-in Presets/Library

Pandas

A built-in engine for data manipulation and analysis, ideal for tasks such as data cleaning, transformation, and exploration on small to medium-sized datasets.

Spark

A distributed data processing engine optimized for large-scale analytics and machine learning tasks, enabling parallel processing across clusters for efficient big data handling.

Node Categories

A node in a platform UI context, especially in data processing and analytics platforms, serves as building blocks in a workflow to represent an operation, transformation, or represent a dataset. Nodes can be interconnected to form a pipeline or workflow that processes data step-by-step.

Dataset Nodes

Nodes that represent core data entities within a workflow. These serve as input/output datasets that feed data into the pipeline or capture results, linking data sources to the operations performed.

Compute Nodes

Nodes that perform specific computational tasks, such as Optical Character Recognition (OCR), data extraction (e.g., Textract), or machine learning inference. These nodes handle data-intensive processing using underlying engines for efficient task execution.

Connector Nodes

Nodes that enable seamless data integration with external systems, APIs, or data sources. They handle both inbound and outbound data flows, facilitating data movement between the workflow and platforms such as databases, data lakes, or external APIs.

HITL (Human-In-The-Loop) Nodes

Incorporate human decision points within automated workflows, pausing execution for human review or approval before continuing, ideal for tasks requiring manual oversight.

Model Nodes

Nodes that handle model training, deployment, and versioning within a workflow, encapsulating model development and producing artifacts for reuse or deployment, crucial for AI/ML workflows.

Transform Nodes

Apply data transformations or wrangling operations to datasets, such as data cleansing, feature engineering, or reshaping, enabling workflows to process and refine data before passing it to subsequent nodes.

Workflow Nodes

Represent entire workflows within a larger workflow, allowing modular design and reuse. These nodes encapsulate pre-built workflows, enabling complex processes to be nested and reused, enhancing efficiency.

HTTP Nodes

Enable interaction with external services through HTTP requests. These nodes allow workflows to trigger API calls, fetch external data, or send results to third-party services, integrating external web services within workflows.

Speed Run

Vue.ai’s speed-run mode enables faster workflow development by executing workflows on sample data. Users can quickly test and verify parts or all of a workflow on sampled data, reducing development cycles and increasing productivity.

Sample Data

Datasets can be sampled for preview and used in speed-run mode, with configurable options for row count and sampling methods:

Sampling Method

Random
From Beginning
From End

Deploy Workflow

Schedule

Set time-based triggers to run workflows at specified intervals.

Run Workflow

Once deployed, workflows can be triggered manually or based on a schedule, with each run recorded as a job. Jobs provide insights into execution and progress of each workflow instance. Configuration changes in workflows apply to newly scheduled jobs.

BYO Nodes

Node Types / Code Nodes

Create custom nodes with your preferred runtime and engine, define schemas, organize nodes into groups for governance, configure deployments, and continuously optimize for performance and cost.

How to Add and Use a Node in the Workflow?

To add or create a node on the workflow, click on the ">" button. This will expand a panel with all nodes that can be used.

Navigation Panel

Node Creation

To create a new node, i.e., a Custom node, Click on “Add Node” (pointed in green), which prompts you to a small window to name your custom node and define its type. After which, its corresponding node environment opens in the same window and workflow that you are currently using.

After creating custom configurable nodes and such, to get them updated on the panel, Click on the Refresh button near “Nodes” (pointed in red).

To check for nodes available under every Node Type, similarly, Click on the “v” button to expand any of the necessary node types that you need to use. It will open and list down all nodes existing under that type to be added to the workflow.

Once decided, Click, Drag, and Drop the Node onto the workflow that you are currently using.

Node Workflow

The Warning symbol over the node appears until the necessary fields on the node are filled and saved.

Example: Dataset Reader - The node can be renamed. A dataset has to be added to the node and saved.

Code Server

Vue.ai’s code server facilitates custom node development by providing a familiar IDE environment for coding, complete with access to preferred tools and libraries. This setup supports an easily adoptable development process, allowing users to seamlessly create custom nodes.

Developer Hub

Equip data scientists and engineers with cutting-edge notebooks and MLOps solutions. Streamline development workflows for scalable and production-ready applications. Equip developers with tools for advanced data science and machine learning operations.

MLOps

This section outlines the foundational concepts of MLOps, focusing on Experiments and Models, and their importance in machine learning workflows.

Experiments

An experiment groups together models aimed at solving a specific problem. Experiments involve the systematic process of developing and refining machine learning models to achieve specific outcomes.

Why Do We Need to Group?

Multiple Configurations: Manage different model configurations such as hyperparameters, algorithms, and preprocessing techniques.
Version Control of Models and Data: Track different versions of models and the data used in training them, ensuring reproducibility and the ability to roll back to previous versions.
Compare Different Models: Easily compare various models built for a specific task to identify the best-performing one.
Deploy the Best Model in Production: Streamline the process of selecting and deploying the optimal model to production.
Tagging: Tag the best model under each experiment to facilitate traceability and reproducibility.

Models

Models in MLflow represent the key entity for managing machine learning models and their associated metadata.

Functions of Models

Model and Data Versioning: Maintain versions of models and the data they were trained on, ensuring consistent and reproducible experiments.
Storing Artifacts: Store model files and preprocessing artifacts, enabling easy retrieval and deployment of models along with their required preprocessing steps.
Tracking Metadata: Capture metadata associated with models, such as hyperparameters, metrics, and the environment in which the model was trained.
Model Registry: Provide a centralized repository for registering, organizing, and managing models, allowing for collaboration and versioning.
Model Deployment: Facilitate model deployment by packaging models with their required dependencies and enabling seamless integration with deployment platforms.

Experiment Tracking Flow

This section outlines the systematic approach for managing experiments, registering models, and handling inference tasks in a machine learning pipeline.

Experiment Initialization and Artifact Management

When an experiment is created through the API, several actions take place to set up the necessary infrastructure for tracking and managing models and artifacts.

Experiment Registration

The experiment is registered in the ML Client, which acts as a centralized container for all models and their associated metadata. This structure enables easy tracking and management of multiple models over time.

Artifact Folder Creation

A dedicated folder is created to store all artifacts associated with the experiment. This folder typically includes:

Preprocessor and model objects
Feature interpretability plots and tables
Model performance metrics and scores
Sample datasets for validation and testing

By organizing artifacts in this way, users can efficiently manage all components related to a particular experiment.

Tracking Preprocessing Artifacts and Model Objects during Model Creation

When the user passes a model object or a path pointing to the trained model, MLflow creates a library-agnostic copy of the model for inference. This ensures uniformity across deployment and prediction tasks, regardless of the machine learning library used for training.

MLflow Model Flavors

MLflow supports various model flavors to handle nuances of different libraries while providing a consistent interface for users. These include:

scikit-learn
XGBoost
TensorFlow
PyTorch
Statsmodels

Components of an MLflow Model

MLmodel file: A configuration file specifying how to load and use the model. It includes metadata about the model's flavor and the paths to necessary files.
model.pkl: A serialized file containing the trained model's weights, essential for making predictions.
Environment Files: These include conda.yaml, requirements.txt, and python_env.yaml, which specify dependencies for running the model in a consistent and replicable environment.

Artifact Storage

Polycloud Support: Integration with multiple cloud storage providers (e.g., S3, Azure, GCP) ensures flexibility and compatibility with MLflow.
Base Artifact Location: During application startup, a base artifact location is initialized with the following directory structure:

EXPERIMENT_ARTIFACT_LOCATION / ENV / CLIENT / EXPERIMENT_NAME / ml_client_model_id

Here, ml_client_model_id is generated at the time of model creation, providing a unique identifier for each model's artifacts. This systematic organization facilitates efficient management of experiment-related artifacts.

Performance Evaluation through MLflow

The MLOps APIs support tracking training and validation metrics throughout the machine learning lifecycle. Key features include:

Monitoring performance indicators during training and validation phases.
Tracking model parameters such as hyperparameters and architecture configurations.

This capability enables users to compare models side-by-side based on their parameters and performance metrics, streamlining the process of selecting the best-performing model for deployment.

Inference with Preprocessing Pipeline

After identifying the best-performing model, it is utilized for inference tasks. Along with the model, the preprocessing pipeline used during training is saved. Key aspects include:

Ensuring new data undergoes the same transformations applied during training.
Maintaining the sequence of preprocessing steps to guarantee consistent and accurate predictions.

By preserving the preprocessing pipeline, the system ensures data integrity and reliability during inference, providing seamless transitions from training to production environments

Notebooks

Notebooks provides a centralized, scalable, and collaborative environment for data scientists, analysts, and researchers. It enables teams to access notebooks in a shared infrastructure, making it easier to perform exploratory data analysis, interactive computing, and computational storytelling.

What is a Notebook?

A Notebook is an interactive computing environment that allows users to combine code, markdown text, visualizations, and equations in a single document. It is widely used for:

Data exploration and visualization
Machine learning prototyping
Scientific computing and simulations
Collaborative research and reporting

Core Features

Code Execution: Supports multiple programming languages, including Python, R, and Spark
Rich Outputs: Inline visualizations, tables, and interactive widgets
Markdown Support: Enables formatted text, LaTeX equations, and embedded media
Reproducibility: Notebooks can be versioned, shared, and rerun with different datasets

Benefits and Advantages

Notebook extends Jupyter notebooks by providing a multi-user environment where users can work in a scalable and managed infrastructure.

Benefits Over Local IDEs

Feature	Notebooks	Local IDE
Collaboration	Shared workspace, multiple users	Limited, requires additional tools
Scalability	Can run on clusters, cloud, or Kubernetes	Limited to local machine resources
Resource Management	Centralized administration of compute resources	Manual management of dependencies
Notebook-based Workflow	Interactive and visual	Script-based, requires additional setup
Reproducibility	Ensures uniform execution environments	Can vary due to local setups
Security	Centralized authentication, role-based access	Dependent on local configurations
SDK Integration	Direct connectivity with platform for data workflows	Requires additional setup

Advantages for Data Processing & Analysis

Consistent Execution Environment: Eliminates dependency issues across multiple users
On-demand Compute Scaling: Run resource-intensive tasks on cloud or clusters
Interactive Data Exploration: Enables real-time visualization and computation
Secure Access to Data: Centralized authentication and role-based access
Streamlined Reproducibility: Ensures that notebooks can be shared and rerun consistently
SDK Integration: Vue.ai platform SDK can be used within notebooks to connect with the platform, build narratives, and leverage data more effectively

Architecture Components

Key Components

Hub: The central service managing user authentication and notebook servers
Proxy: Routes user requests to the appropriate notebook servers
Spawner: Starts and manages individual Notebook instances for users

Additional Information

For further information on the key components, please visit the JupyterHub documentation.

Use Cases

Enterprise Data Science: Large teams can collaborate with managed compute resources
Cloud-based Machine Learning: Train and deploy ML models with scalable infrastructure
Big Data Analytics: Process and visualize large datasets interactively
SDK-Powered Data Workflows: Leverage Vue.ai SDK within notebooks for enhanced data processing and insights

Summary

Notebooks provides a scalable, collaborative, and efficient environment for data professionals, eliminating many of the constraints of traditional IDEs and local development setups. With its ability to manage multiple users, allocate resources dynamically, and support interactive data analysis, it is a powerful solution for modern data science workflows. Vue.ai SDK further enhances notebooks by enabling direct connectivity with the platform for data exploration and narrative building.

Customer Hub

The Customer Hub provides powerful tools for creating targeted audiences, managing personalized content, and analyzing customer behavior across digital touchpoints.

Audience

Overview

An audience refers to a group of users who share similar characteristics, interests, or behaviors. Specific strategies can be applied to target these audiences based on their traits and actions.

Examples

Users who added an iPhone to their cart and purchased it within 2 days.
Users in the top 1 percentile of affinity to the apparel category.

Features

A feature is a dimension or attribute that describes a user. Audiences can be defined by applying filter conditions to features, which can include demographic information, interests, and behaviors.

Examples

Gender of the user.
Number of visits (#Visits).
Number of purchases (#Buys).
Affinity to the electronics category.

Feature Groups

A feature group is a logical bundling of related features. For example, the features "Gender" and "Age" may fall under the "Demographics" feature group. Feature groups can also be organized at the brand level.

Examples

Tatacliq Buys - Last 30 Days describes the number of purchases made on Tatacliq in the past 30 days, and belongs to the "Tatacliq" feature group.
Croma Buys - Last 30 Days describes the number of purchases made on Croma in the past 30 days, and belongs to the "Croma" feature group.

Sequences

A sequence refers to a specific series of user interactions. An audience can be defined by identifying users who have performed certain sequences in their history.

Examples

Users who added a Samsung phone to their cart and purchased it within 1 day.
Users who added an Apple Watch to their cart but did not purchase it within 2 days.

Metrics

Audience metrics measure the performance of defined audiences based on business goals. Metrics include key performance indicators such as conversion rate, revenue, and engagement.

Conditions and Operators

Conditions, Operators, and Values

A condition defines an audience by specifying a feature, an operator, and one or more values.

Example

Condition: Brand == Nike
Feature: Brand
Operator: ==
Value: Nike

Other examples

Number of Buys > 3.
Average Order Value > $50.

Boolean Operators

Boolean operators combine multiple conditions or rules and return a true/false result based on the logical relationships between them.

AND: Returns true if all conditions are met, false otherwise.
OR: Returns true if at least one condition is met, false otherwise.

Time Operators

Time operators define time intervals between two events in a sequence.

None: No time interval between events.
After & Within: Counts visitors who meet the condition within a specified duration after an event.
After: Counts visitors who meet the condition after a specified duration.
Within: Counts visitors who meet the condition within a specified duration.

Logical Operators

Logical operators are used to compare values of features when defining conditions.

==: Exactly equal to the given value.
>: Greater than the given value.
>=: Greater than or equal to the given value.
<: Less than the given value.
<=: Less than or equal to the given value.
!=: Not equal to the given value.
IN: Checks if the visitor exists within a given set of values.
~: Approximately equal to the given value.

Rules and Groups

Rules

A rule consists of one or more conditions. Each rule is separated by either an AND or OR operator and is used to define an audience.

Groups

A group contains one or more rules. There are two types of groups: Condition Groups and Sequence Groups.

Condition Group

A condition group consists of one or more rules, with each rule defined at the same level. By default, one condition group is present, but users can add any number of additional condition groups.

Sequence Group

A sequence group consists of one or more sequentially executed rules, each representing an event with associated attributes. Time intervals between rules can be defined, and users can add any number of sequence groups.