Core Concepts
Data Hub
Seamlessly integrate and unify your enterprise data.
- Centralize data management for enhanced operational efficiency.
- Effortlessly upload, register, and organize documents at scale.
- Unlock insights with robust business intelligence reporting tools.
Connection Manager
A Connection refers to an automated data pipeline that replicates data from a source to a destination. It establishes a link between a configured source (using a source connector) and a configured destination (using a destination connector) to enable data synchronization.
Sources
A source refers to any system from which data is ingested, such as an API, file, database, or data warehouse.
- Always test your connection before proceeding
- Ensure proper credentials are provided
- Verify data access permissions
Overview
A source refers to any system from which data is ingested, such as an API, file, database, or data warehouse. Setting up a source involves configuring the necessary variables that allow the connector to access and retrieve data. The specific configuration fields may vary depending on the type of connector but typically include credentials for authentication (e.g., username and password, API key), as well as parameters that determine which data to extract. For example, this may involve specifying a start date for syncing records or defining a search query to match the records.
Source Definition
The definition of a source depends on the type of system—whether it's an API, database, file, or data warehouse—and the parameters required for secure connection or authentication. The configuration fields are determined by the connector type and the security protocols necessary to access the data.
Test Connection
After providing the source configuration details, the Test Connection function is used to validate whether the authentication information is correct. It also verifies that the system can successfully connect to the source using the supplied credentials.
Library
Below are examples of top sources that the Vue.ai platform supports for integration:
- Google Sheets
- PostgreSQL
- Amazon Redshift
- HubSpot
- Shopify
CDK
If the desired source is not available in the list of supported connectors, users can utilize the Connector Development Kit (CDK) to build a custom connector. This kit enables the creation of a custom Python-based source to integrate with the system.
Destinations
Overview
A destination refers to the target system where ingested data is loaded, such as a data warehouse, data lake, database, analytics tool, or the Vue.ai platform.
Destination Definition
The definition of a destination depends on the type of system—whether it's a data warehouse, data lake, database, analytics tool, or the Vue.ai platform. It involves configuring the necessary parameters to establish a connection for loading ingested data into the target system. The specific configuration fields vary based on the connector type and the security protocols required to ensure secure data transfer.
Vue Dataset
The Vue Dataset acts as a destination for loading data into the Vue.ai platform's data catalog. It is used for any data that needs to undergo workflow processes or report generation within the platform. By using Vue Dataset as a destination, users can fully leverage the data analysis and management features of the Vue.ai platform.
Test Connection
Once the destination configuration details are provided, the Test Connection function is used to verify the accuracy of the authentication details and confirm that the system can successfully connect to the destination. This step ensures that data can be seamlessly loaded into the target system without any issues.
Library
Below are examples of top destinations that the Vue.ai platform supports for integration:
- PostgreSQL
- Amazon Redshift
- Vue Dataset
CDK
If the desired destination is not listed among the supported connectors, users can leverage the Connector Development Kit (CDK) to create a custom connector. This enables the development of a tailored destination, allowing seamless integration with the chosen system.
Connections
Overview
A Connection refers to an automated data pipeline that replicates data from a source to a destination. It establishes a link between a configured source (using a source connector) and a configured destination (using a destination connector) to enable data synchronization. The connection defines essential parameters, including the frequency of replication (e.g., hourly, daily, or manual) and the specific data streams to be replicated.
Stream
A stream represents a collection of related records. Depending on the destination, it may be referred to as a table, file, or blob. The term "stream" is used to generalize the flow of data across different destinations.
Examples of Streams:
- A table in a relational database
- A resource or API endpoint in a REST API
- Records from a directory containing multiple files in a filesystem
Record
A record is an individual entry or unit of data, often referred to as a "row." Each record is typically unique and encapsulates information related to a specific entity, such as a customer or transaction.
Examples of Records:
- A row in a relational database table
- A line within a data file
- A data unit retrieved from an API response
Batch
A batch refers to a group of records processed and transferred together as a single unit. Batching is commonly used to efficiently transfer large volumes of data instead of processing records individually.
Examples of Batches:
- A collection of rows in a relational database that are updated simultaneously
- A set of files transferred together during a data migration
- Multiple data entries sent in a single API request
Cursor Field
A cursor field represents a specific attribute of a record within a stream. When the source synchronization mode is configured for incremental updates, a cursor field is recommended as it serves as a primary key, uniquely identifying each record.
Examples of Fields:
- A column within a relational database table
- An attribute within an API response
Sync Schedule
There are three methods available for scheduling synchronization:
- Scheduled: This option allows you to set sync intervals, such as every 24 hours or every 2 hours.
- CRON Schedule: For more advanced scheduling, you can use a CRON expression to define specific timing for sync operations.
- Manual: You can initiate a sync manually by clicking the "Sync Now" button in the user interface or through the API.
Change Data Capture
Change Data Capture (CDC) is the process of capturing and tracking changes made to a database, including inserts, updates, and deletes. CDC ensures that the target system stays synchronized with the source database in real-time, which is essential for data warehousing, business intelligence, and replication scenarios.
Sync Modes
Sync modes define how data is retrieved from a source and transferred to a destination. Vue.ai offers several sync modes to meet different objectives, each impacting how synchronization occurs and whether duplicate records will be generated in the destination.
Sync modes are defined by two components: Source Sync Mode and Destination Sync Mode.
Source Sync Mode
This part describes how data is read from the source:
Mode | Description |
---|---|
Incremental | Reads only the records added since the last sync. The first sync acts as a Full Refresh. |
Method 1 | Uses a cursor to track the last processed record, allowing for the retrieval of new records. |
Method 2 | Captures changes in real-time and is supported by some sources. For more details, refer to the CDC documentation. |
Full Refresh | Reads all records from the source, regardless of previous syncs. |
Destination Sync Mode
This part specifies how data is written to the destination:
Mode | Description |
---|---|
Overwrite | Replaces existing data in the destination with new data. |
Append | Adds new data to existing tables without altering any pre-existing records. |
Append Deduped | Appends data to existing tables while keeping a history of changes. The final table is de-duplicated using a primary key. |
Overwrite Deduped | Replaces existing data and removes duplicates from the final dataset, ensuring uniqueness with a primary key. |
Destination Stream Name
Available exclusively for the Vue Dataset destination, this feature represents the dataset where you intend to load your data for integration into the system. Users can either use an existing dataset with a matching schema or create a new dataset to load the data.
Validation Reports
After each successful sync run, a validation report is generated. It includes key metrics and definitions:
Key | Description |
---|---|
attempt | Indicates the number of the current sync attempt. |
bytesSynced | The total number of bytes that were successfully synced during the attempt. |
recordsSynced | The total number of records that were successfully synced. |
totalStats | An object containing aggregate statistics for the sync attempt. |
recordsEmitted | The total number of records emitted (processed) during the sync. |
bytesEmitted | The total number of bytes emitted during the sync. |
stateMessagesEmitted | The number of state messages emitted during the sync process. |
recordsCommitted | The total number of records that were successfully committed to the destination. |
streamStats | An array that provides detailed statistics for individual data streams that were processed during the sync. |
Each object within the streamStats
array includes:
Key | Description |
---|---|
streamName | The name of the data stream being reported on. |
stats | An object containing statistics specific to the stream. |
recordsEmitted | The number of records emitted for this specific stream. |
bytesEmitted | The total bytes emitted for this stream. |
recordsCommitted | The number of records that were successfully committed for this stream. |
failureSummary | Provides information about any failures that occurred during the sync attempt. A value of null indicates that there were no failures in this attempt. |
Document Manager
The Document Manager provides comprehensive capabilities for intelligent document processing, from defining document types and taxonomies to executing complex extraction workflows.
Document Types
Intelligent Document Processing (IDP) in the Document Manager automates the extraction of specific information from documents. This is achieved by creating and managing reusable templates that teach the AI what information to look for and where to find it. The entire system is built upon two fundamental concepts: the Document Type and its corresponding Taxonomy.
A Document Type
is a reusable blueprint or model that defines a specific category of document. Think of it as a template for processing "Invoices," "US Driver's Licenses," or "Bank Statements." By creating a distinct Document Type for each, you tell the system how to handle them individually.
Each Document Type is defined by two key aspects: its structure (the physical layout) and its taxonomy (the data to be extracted).
Document Structure and Handling
This setting informs the AI model about the document's structural consistency and any required pre-processing steps. Choosing the correct option is crucial because it helps the model decide which cues (visual or semantic) to prioritize, leading to higher accuracy.
Layout Categories
These categories describe the inherent structure of the document itself.
Structured
- Definition: These documents have a fixed, unchanging format where data fields appear in the same position on every instance of the document.
- Examples: Government forms (like a W-9), application forms, passports.
- Why It Matters: For structured documents, the AI model heavily relies on spatial cues (the physical location of text). Once it learns that the "Date of Birth" is in a specific spot, it will look there first on all subsequent documents.
Semi-structured
- Definition: These documents contain a similar set of information, but the layout can vary from one instance to another. They have a predictable structure but not a fixed format.
- Examples: Invoices (different vendors have different templates), purchase orders, receipts, and bank statements. The
Invoice Number
will always be present, but its location can change. - Why It Matters: The AI model uses a combination of spatial cues and semantic understanding. It knows to look for a field labeled "Invoice #" or a value that looks like an invoice number, regardless of its exact position.
Unstructured
- Definition: These documents have no predefined format or consistent layout. The required information is embedded within free-flowing text.
- Examples: Legal contracts, emails, business reports, and press releases.
- Why It Matters: The model relies almost entirely on semantic relations and context to find the relevant information. It understands language to identify a "contract start date" or "termination clause" based on the surrounding sentences.
Specialized Document Handlers
These are specialized pre-processing workflows that handle common, complex scenarios before the extraction logic is applied.
ID Cards
- What it does: This option is designed for images or pages containing one or more ID cards. The system first runs a detection model to find and crop each individual card.
- Why it matters: It automatically isolates each card, treating it as a separate page for extraction. This is essential when a single scanned image contains multiple IDs, ensuring that data from one card isn't confused with another.
Doc Detect
- What it does: Ideal for multi-document PDFs where different document types are combined into one file (e.g., an application packet containing an application form, a driver's license, and a bank statement). This feature allows you to classify each page or group of pages.
- Why it matters: It enables the system to split a single file into logical sub-documents and then apply the correct Document Type extraction logic to each distinct section, automating complex document separation tasks.
Bank Statement
- This is a pre-configured template optimized for the semi-structured nature of bank statements, providing a head start on building the taxonomy for this common document type.
How Document Types Link to Extraction
Registering a Document Type
is the training step. Once a Document Type is finalized and Registered, it becomes an active model. You can then upload new documents and assign them to that type. The system will apply the learned layout rules and taxonomy to automatically extract the specified data fields from the new document.
Taxonomy Overview
The Taxonomy
is the heart of a Document Type. It is the complete, structured list of all the data fields (or attributes) that you want to extract from that document.
Attribute Properties
Each attribute in the taxonomy is defined by a set of properties that control the extraction process.
- Name: The unique, user-friendly name for the data field (e.g.,
Date of Expiry
). - Annotation: The visual link between the attribute and its location on the example document, created by drawing a bounding box.
- Type: The data type of the attribute. Specifying the correct type enables data validation and specific formatting rules. Common types include:
- Alpha Numeric, Barcode, Checkbox, Date, Enum, Free Form Text, Name, Numeric, Signature.
- Table: A special, complex type for extracting structured data from grids or tables.
- Enable Redaction: A security feature for masking personally identifiable information (PII) or other sensitive data.
- Tags: Keywords or labels you can assign to an attribute for organization and filtering.
- Description: A brief, plain-language explanation of what the attribute represents.
- Instruction: Critical context or hints that guide both the AI model and human reviewers. For the model, it acts as a prompt to disambiguate information.
The Table Attribute Type
Extracting data from tables is a common requirement and is treated as a special attribute type. A Table
attribute is not a single value but a collection of rows and columns, and it comes with its own powerful configuration interface.
Annotating a Table:
The process of annotating a table is more interactive than for simple fields:
- Initial Setup: In the attribute properties pane, you provide an initial estimate for the number of
Columns
andRows
in the table. You also specify if theFirst row is header
. - Bounding Box: You draw a single bounding box that encompasses the entire table area on the document.
- Grid Adjustment: The system overlays a grid based on your initial setup. You can then interactively drag the horizontal and vertical grid lines to precisely align them with the cell borders of the table in the document, ensuring perfect cell detection.
Configuring the Table Schema:
After annotating the table's location, you must define its internal structure by clicking Configure Columns. This opens a new view where you define the schema for the data to be extracted. For each column you want to capture, you can configure the following properties:
- Header: The standardized, canonical name you want to use for this column in your final data output (e.g.,
date_of_birth
,item_sku
). This ensures a consistent schema regardless of how the header is written in the document. - Alias: A list of possible header names that might appear in different versions of the document. This is a powerful mapping tool. For example, your standardized
Header
might beperson_name
, but theAliases
could include "Name of Person," "Full Name," and "Applicant Name." The system will recognize any of these aliases and map the data to the correct standardized header. - Strict Matching: A toggle that controls matching behavior.
- Enabled: The system will only extract data for this column if the header in the document is an exact match to one of the specified aliases.
- Disabled: The system can use more flexible semantic matching to identify the column, even if the header text doesn't match an alias perfectly.
- Data Type: Just like regular attributes, you can assign a specific data type (
Numeric
,Date
,Text
, etc.) to each individual column. This enables validation and formatting at the column level. - Description & Instruction: Field-level guidance for each specific column, providing context for both the AI model and any human reviewers.
By defining a table schema, you transform messy, variable table data into a clean, structured, and predictable JSON output (typically an array of objects) ready for use in downstream systems.
Key Processes and Statuses
0-Shot Extraction
When you first upload a sample document, the system performs a 0-shot extraction—an automated pass to identify potential fields. This gives you a pre-populated taxonomy as a starting point to refine.
Document Status: Draft vs. Registered
- Draft: The "in-progress" state for building and refining a Document Type.
- Registered: The final, "live" state. A registered model is ready to be used for processing new documents.
Taxonomy
The taxonomy defines the full set of attributes to be extracted from the document. It includes the following components:
- Attribute Name: Specifies the unique name of each attribute.
- Data Type: Indicates the type of data associated with the attribute (e.g., text, number, date).
- Formatting Configuration: Outlines any specific formatting required for the attribute.
Currently, the taxonomy supports a single-level hierarchy, except in cases involving tables. For multi-page tables, a hierarchical structure applies: the top level displays the merged table, while the underlying individual tables are indented one level below.
Taxonomy Actions
Add New Attribute
To add a new attribute to the taxonomy:
- Click + Add New.
- Enter a name for the attribute.
- Specify the attribute's data type.
- Annotate, if necessary.
- Click Save.
Delete Attribute
To delete an existing attribute:
- Hover over the attribute you wish to delete.
- Open the context menu and select Delete.
- Confirm the deletion action when prompted.
Multi-Select Attributes
For bulk actions on multiple attributes, you can use the multi-select feature. Currently, the primary bulk action supported is deletion.
- To select all attributes, click Select All. Alternatively, use the checkboxes next to each attribute to manually select or deselect items.
- Once your selection is complete, choose Delete to remove the selected attributes.
Document Extraction
After a Document Type
is registered, the platform is ready to process documents. The lifecycle of a document involves several key stages, from ingestion and extraction to data storage, post-processing, and performance monitoring.
Document Ingestion Methods
There are two primary ways to send documents to the platform for processing:
UI Upload: Users can manually upload individual files or batches of documents directly through the Documents Hub. This is ideal for ad-hoc tasks, bulk processing of historical files, or workflows where human operators are the starting point.
API Integration: For automated, high-volume workflows, documents can be submitted programmatically via a dedicated REST API endpoint. This allows you to integrate the IDP service directly into your existing applications, such as an email intake system, a mobile app for receipt scanning, or a legacy system's document queue.
The Extraction Workflow
Once a document is ingested, it triggers a workflow in the Automation Hub. This workflow typically consists of:
- Classification (Optional): If you used
Auto Classify
, the first step is to identify the correctDocument Type
. - Extraction: The system applies the taxonomy of the identified
Document Type
to extract the relevant data fields. - Post-processing (Optional): After extraction, the data can be passed to subsequent nodes in the workflow for further refinement.
Extensibility with Code Nodes
The true power of the platform lies in its extensibility. You can add custom Code Nodes (e.g., Python scripts) to the workflow after the IDP step. This enables limitless post-processing possibilities, such as:
- Custom Validation: Implementing complex business rules that go beyond standard data types (e.g., "ensure the delivery date is after the order date").
- Data Enrichment: Calling an external API to enrich the extracted data (e.g., using an address to look up its coordinates).
- Custom Formatting: Transforming the data into a specific format required by a downstream system.
Human-in-the-Loop: Review and Annotation
After the automated workflow completes, documents are typically set to a Pending
status for human review. This "human-in-the-loop" step is crucial for:
- Quality Assurance: Correcting any errors made by the AI.
- Handling Low-Confidence Extractions: Focusing operator attention on fields where the model was uncertain.
- Continuous Learning: Every correction made during annotation serves as feedback that helps retrain and improve the AI models over time.
Data Storage and Organization
Extracted data is not just a temporary result; it is stored in a structured and accessible way within the platform's Data Hub, populating two distinct datasets:
Documents Dataset (Transactional)
- Structure: Each row represents a single uploaded document.
- Columns: Includes metadata like
Document ID
,Document Name
,Status
(Pending/Reviewed),Assigned User
,Batch Name
, andTags
. - Purpose: Optimized for operational and transactional queries, such as "Find all pending documents assigned to User A" or "Show me all documents from the 'Q4_Invoices' batch."
Document Extraction Dataset (Analytical)
- Structure: Each row represents a single attribute extracted from a document (a "long" format).
- Columns: Includes
Document ID
,Attribute Name
,Extracted Value
,Confidence Score
,Data Type
, etc. - Purpose: Optimized for analytical queries across many documents. For example, "What is the average 'Total Amount' for all invoices this month?" or "How many driver's licenses expire in the next 90 days?"
This dual-dataset approach provides flexibility, allowing other platform features like Datasets and Workflows to easily query, join, and act upon the extracted information.
Exporting Data and Documents
You can export data directly from the Documents Hub. The export functionality is flexible, allowing you to download:
- Extracted Data: In structured formats like
JSON
(for hierarchical data) andCSV
(for tabular analysis). - The Documents Themselves: You can download the original files or versions with redaction applied, ensuring that sensitive information is masked before the document leaves the platform.
Performance Monitoring and Metrics
The lifecycle is complete once you monitor the performance of your Document Types
. From the Document Type listing page, you can generate and view metrics that provide insight into:
- Usage Metrics: Volume of documents processed, number of pages, processing times, etc.
- Model Performance: Key metrics like field-level accuracy, straight-through processing (STP) rates (i.e., documents requiring no human correction), and the distribution of confidence scores.
This data is essential for understanding your ROI, identifying bottlenecks, and deciding which Document Types
may need further refinement.
Dataset Manager
The Dataset Manager provides comprehensive capabilities for organizing, managing, and analyzing structured data within the Vue.ai platform.
Data Catalog
What is a Data Catalog?
A data catalog is a centralized repository that stores metadata about data assets within a platform or organization. It enables centralized access control, auditing, lineage, and data discovery, serving as a comprehensive inventory of all available data sources. The catalog provides detailed information about each dataset, including its structure, location, ownership, usage, and other relevant attributes.
By leveraging a data catalog, users can efficiently discover, understand, access, and manage data, ensuring streamlined operations and enhanced collaboration.
Key Features of a Data Catalog
Centralized Metadata Management
- Stores and manages metadata, such as data definitions, descriptions, schemas, and lineage.
- Fosters collaboration across roles, including data consumers, analysts, data scientists, machine learning engineers, marketing, sales teams, and more.
Data Discovery
- Facilitates the search and exploration of datasets using metadata.
- Enables users to locate relevant datasets from a centralized repository by applying various filters and criteria.
Data Lineage
- Tracks the origin, source, and transformations of data throughout its lifecycle.
- Provides insights into how data flows and is manipulated across different systems and processes.
Data Auditing and Governance
- Captures user-level audit logs that record access to datasets and dataset groups.
- Enforces policies and standards for data usage, security, and compliance within the platform.
Permissions and Collaboration
- Grants users secure access to raw data stored in the cloud via credentials.
- Enables annotation, commenting, and sharing of insights about datasets and dataset groups, enriching metadata and fostering collaboration.
Data Quality Management
- Assesses and monitors the quality of data.
- Measures key metrics such as completeness and consistency at both the dataset and dataset group levels.
Datasets
What is a Dataset?
A dataset is a collection of data, organized and structured according to a predefined schema. This schema enforces a consistent structure, ensuring that the data can be effectively utilized for various purposes.
Datasets serve as cohesive units of information related to a specific purpose and are fundamental to a data catalog.
Common Use Cases of Datasets:
- Transactional purposes: Managing and recording business transactions.
- Processing workloads: Supporting ETL (Extract, Transform, Load) operations or machine learning and data science workflows.
- Analytics and insights: Driving business decisions by analyzing data to solve specific problems.
What is the Difference Between Data and Metadata?
Data: Represents the actual content or raw information stored, processed, or analyzed. It consists of facts, observations, or values in various forms such as text, numbers, images, or videos.
- Example: In a dataset of customer records, data includes attributes like customer name, age, address, and purchase history.
Metadata: Provides context and descriptive information about the data, facilitating its understanding, management, and usage. Metadata does not contain actual data content but describes it.
Types of Metadata:
- Descriptive: Summarizes the data, including titles, tags, and keywords.
- Structural: Defines how data is organized, such as schemas, tables, fields, or columns.
- Administrative: Details ownership, creation/modification history, and access controls.
- Technical: Specifies technical aspects like file format, size, encoding, data source, quality, or lineage.
What is a Row (or Record) in a Dataset?
A row or record in a dataset can contain:
- Unstructured text: Free-form strings.
- Boolean values: True/false.
- Numerical values: Integers or floating-point numbers.
- Categorical variables: Predefined categories represented as strings or Booleans.
- Structured text: JSON objects, arrays, or vectors.
- Cloud object references: Links to files like images, documents, or audio stored in the cloud.
The choice of data type depends on the data's nature, intended use, and storage requirements.
Entities that Define a Dataset
Dataset Description:
A brief overview of the dataset and its content. This can be manually entered or auto-generated using the magic wand feature (🪄).Dataset Size and Number of Records:
Details the number of rows and the storage size of the dataset.Tags:
Labels that help in organizing datasets logically, enabling easier search and identification.Dataset Schema:
An overview of the dataset's composition, including columns and their data types. Column descriptions can also serve as a data dictionary for better understanding.
What is Dataset Profiling?
Dataset profiling examines underlying data and schema details to uncover distribution patterns, outliers, and dependencies. This aids in effective data management.
Schema Details in Profiling:
- Validations
- Memory (MiB)
- Distinct values: Count and percentage of unique values.
- Negative values: Count and percentage of negative values.
- Missing values: Count and percentage of missing values.
- Mean: Average value of the column.
- Min/Max: Minimum and maximum column values.
- Correlation: Relationship with other columns.
- Count: Total column values.
What is Dataset Sampling and Why is it Done?
Sampling involves selecting a representative subset of data from a larger dataset for analysis. Common sampling methods include random selection or choosing top/bottom N records. The sample size is critical to ensure accurate representation.
Metrics
Metrics and reports enable slicing, dicing, analytics, and summarization of datasets. Users can create aggregations or metrics based on dataset columns and visualize them using charts.
Options for Metrics Creation:
- Utilize the platform's built-in intelligence to auto-generate reports.
- Manually create customized reports to meet specific requirements.
Dataset Groups
Dataset groups are organizational units that logically group related datasets to support isolation, management, analysis, and governance. They enable effective organization and management within a data ecosystem, similar to folders in a file system or schemas in a database, facilitating hierarchical organization of datasets.
Dataset groups can isolate datasets relevant to specific use cases, projects, or security boundaries, creating a structure that supports efficient data handling and governance.
Data Relationships
A relationship is a connection or association between two or more datasets, illustrating how they are linked within a dataset group. These relationships are essential for understanding interconnections among datasets, enabling more comprehensive metadata management, analysis, and exploration of the ecosystem.
By defining relationships within a dataset group, users gain insights into data connections, enabling downstream tasks—such as data processing, management, and analysis—that drive business outcomes.
Cardinality
Cardinality defines the number of instances of one dataset that relate to instances of another through a specific relationship. There are three main types of cardinality:
One-to-One (1:1) Cardinality
- Each instance of one entity is associated with exactly one instance of another entity.
One-to-Many (1:N) Cardinality
- Each instance of one entity can relate to multiple instances of another, while each instance of the second entity relates to only one instance of the first.
- The reverse, a many-to-one relationship, can also exist.
Many-to-Many (N:M or N:N) Cardinality
- Multiple instances of one entity relate to multiple instances of another.
- This often requires an intermediary (junction) table to represent the relationship.
Entity-Relationship (ER) Diagrams
Entity-Relationship (ER) diagrams are graphical representations used to model database structures, depicting:
- Entities: Represented as datasets
- Attributes: Characteristics or schema details of the datasets
- Relationships: Connections between datasets within a database, including defined cardinality
ER diagrams serve as a blueprint for understanding dataset structures and relationships, helping stakeholders make informed decisions and align with data requirements.
Purpose of Dataset Groups
Dataset groups serve several purposes:
Logical Organization
- Provides a structure for organizing datasets based on relationships and domains
- Allows efficient navigation and discovery of datasets, particularly in large data ecosystems
Data & Metadata Management
- Facilitates data management by grouping datasets based on shared characteristics
- Enhances metadata for datasets with similar properties
Relationship Management
- Enables understanding of relationships within a dataset group
- Serves as documentation, aiding schema and relationship comprehension for workflows
Permissions
- Simplifies access control and permissions management for datasets within the group
Data Analysis and Exploration
- Allows users to explore patterns and correlations within the group
- Helps stakeholders visualize and solve problems effectively
Data Governance
- Enforces data governance policies, including quality checks, privacy rules, and compliance standards
Collaboration and Sharing
- Creates a shared environment for accessing and working with datasets
Constraints and Behavior
Dataset and Dataset Group Constraints
- A dataset can belong to only one dataset group
- Two datasets within the same group cannot share the same name
Adding Datasets to Groups
With the right permissions, users can add datasets to groups in the following ways:
Defining Relationships
- Users can view all datasets outside the group to select those to add
Direct Addition
- Users can add datasets directly from the dataset group's information page
Creating a Dataset
- New datasets can be added to an existing group (without a predefined relationship)
- Assigned to a new group
- Left ungrouped
Automation Hub
Design advanced analytics and machine learning workflows tailored to your needs.
- Create custom nodes and automate processes for specific problem statements.
- Streamline the design and execution of workflows with advanced automation capabilities, enabling scalable and efficient data and computational processes.
Workflow Concepts
Overview
A workflow is a structured series of tasks executed in a specific order to achieve a particular objective, represented as a Directed Acyclic Graph (DAG). Workflows automate and manage complex processes, offering flexible control over task execution, condition-based decision-making, and support for both automated and human-in-the-loop operations. They can be triggered in various ways, handle both one-time and recurring tasks, and provide real-time monitoring for tracking progress and outcomes.
Vue.ai's Enterprise AI Orchestration Platform supports various types of workflows, including analytical workflows, data science/machine learning (DS/ML) workflows, and business process workflows, each tailored to meet diverse operational needs.
Key Features
No-Code Workflow Building
Empowers users with a no-code, drag-and-drop interface, allowing for effortless creation of analytical, data science, and business process workflows on the Enterprise AI Orchestration Platform.
Polyglot Compute Orchestration
Unlocks the power of polyglot compute with Vue.ai’s platform, enabling users to seamlessly integrate nodes across multiple programming languages and runtimes within a single workflow. This multi-environment orchestration provides flexibility and performance, allowing users to leverage the best tools for each task while maintaining an efficient workflow experience.
Rapid Prototyping
Accelerate development cycles with Vue.ai’s rapid prototyping feature, running workflows in speed-run mode on data samples for quick verification and iteration, enhancing productivity.
Workflow Builder
The Workflow Builder is a low-code, drag-and-drop tool designed to simplify the creation of workflows. It allows users to configure and build complex workflows with ease, offering both form-driven and JSON-driven editors to sequence nodes. Users can search through a library of nodes, including marketplace and custom options, and create custom nodes with their preferred engines, providing flexibility in customization.
Engines
Engines bring polyglot compute power to workflows, allowing nodes to use different programming languages and environments based on task-specific requirements. This versatility enables workflows to incorporate a variety of engines, including Python-based tasks with Pandas, large-scale analytics with Spark, or custom applications in other languages. Supporting multiple languages and runtimes enhances the capability and efficiency of workflows for handling diverse processing needs.
Built-in Presets/Library
Pandas
A built-in engine for data manipulation and analysis, ideal for tasks such as data cleaning, transformation, and exploration on small to medium-sized datasets.
Spark
A distributed data processing engine optimized for large-scale analytics and machine learning tasks, enabling parallel processing across clusters for efficient big data handling.
Node Categories
A node in a platform UI context, especially in data processing and analytics platforms, serves as building blocks in a workflow to represent an operation, transformation, or represent a dataset. Nodes can be interconnected to form a pipeline or workflow that processes data step-by-step.
Dataset Nodes
Nodes that represent core data entities within a workflow. These serve as input/output datasets that feed data into the pipeline or capture results, linking data sources to the operations performed.
Compute Nodes
Nodes that perform specific computational tasks, such as Optical Character Recognition (OCR), data extraction (e.g., Textract), or machine learning inference. These nodes handle data-intensive processing using underlying engines for efficient task execution.
Connector Nodes
Nodes that enable seamless data integration with external systems, APIs, or data sources. They handle both inbound and outbound data flows, facilitating data movement between the workflow and platforms such as databases, data lakes, or external APIs.
HITL (Human-In-The-Loop) Nodes
Incorporate human decision points within automated workflows, pausing execution for human review or approval before continuing, ideal for tasks requiring manual oversight.
Model Nodes
Nodes that handle model training, deployment, and versioning within a workflow, encapsulating model development and producing artifacts for reuse or deployment, crucial for AI/ML workflows.
Transform Nodes
Apply data transformations or wrangling operations to datasets, such as data cleansing, feature engineering, or reshaping, enabling workflows to process and refine data before passing it to subsequent nodes.
Workflow Nodes
Represent entire workflows within a larger workflow, allowing modular design and reuse. These nodes encapsulate pre-built workflows, enabling complex processes to be nested and reused, enhancing efficiency.
HTTP Nodes
Enable interaction with external services through HTTP requests. These nodes allow workflows to trigger API calls, fetch external data, or send results to third-party services, integrating external web services within workflows.
Speed Run
Vue.ai’s speed-run mode enables faster workflow development by executing workflows on sample data. Users can quickly test and verify parts or all of a workflow on sampled data, reducing development cycles and increasing productivity.
Sample Data
Datasets can be sampled for preview and used in speed-run mode, with configurable options for row count and sampling methods:
Sampling Method
- Random
- From Beginning
- From End
Deploy Workflow
Schedule
Set time-based triggers to run workflows at specified intervals.
Run Workflow
Once deployed, workflows can be triggered manually or based on a schedule, with each run recorded as a job. Jobs provide insights into execution and progress of each workflow instance. Configuration changes in workflows apply to newly scheduled jobs.
BYO Nodes
Node Types / Code Nodes
Create custom nodes with your preferred runtime and engine, define schemas, organize nodes into groups for governance, configure deployments, and continuously optimize for performance and cost.
How to Add and Use a Node in the Workflow?
To add or create a node on the workflow, click on the ">" button. This will expand a panel with all nodes that can be used.
To create a new node, i.e., a Custom node, Click on “Add Node” (pointed in green), which prompts you to a small window to name your custom node and define its type. After which, its corresponding node environment opens in the same window and workflow that you are currently using.
After creating custom configurable nodes and such, to get them updated on the panel, Click on the Refresh button near “Nodes” (pointed in red).
To check for nodes available under every Node Type, similarly, Click on the “v” button to expand any of the necessary node types that you need to use. It will open and list down all nodes existing under that type to be added to the workflow.
Once decided, Click, Drag, and Drop the Node onto the workflow that you are currently using.
The Warning symbol over the node appears until the necessary fields on the node are filled and saved.
Example: Dataset Reader - The node can be renamed. A dataset has to be added to the node and saved.
Code Server
Vue.ai’s code server facilitates custom node development by providing a familiar IDE environment for coding, complete with access to preferred tools and libraries. This setup supports an easily adoptable development process, allowing users to seamlessly create custom nodes.
Developer Hub
Equip data scientists and engineers with cutting-edge notebooks and MLOps solutions. Streamline development workflows for scalable and production-ready applications. Equip developers with tools for advanced data science and machine learning operations.
MLOps
This section outlines the foundational concepts of MLOps, focusing on Experiments and Models, and their importance in machine learning workflows.
Experiments
An experiment groups together models aimed at solving a specific problem. Experiments involve the systematic process of developing and refining machine learning models to achieve specific outcomes.
Why Do We Need to Group?
- Multiple Configurations: Manage different model configurations such as hyperparameters, algorithms, and preprocessing techniques.
- Version Control of Models and Data: Track different versions of models and the data used in training them, ensuring reproducibility and the ability to roll back to previous versions.
- Compare Different Models: Easily compare various models built for a specific task to identify the best-performing one.
- Deploy the Best Model in Production: Streamline the process of selecting and deploying the optimal model to production.
- Tagging: Tag the best model under each experiment to facilitate traceability and reproducibility.
Models
Models in MLflow represent the key entity for managing machine learning models and their associated metadata.
Functions of Models
- Model and Data Versioning: Maintain versions of models and the data they were trained on, ensuring consistent and reproducible experiments.
- Storing Artifacts: Store model files and preprocessing artifacts, enabling easy retrieval and deployment of models along with their required preprocessing steps.
- Tracking Metadata: Capture metadata associated with models, such as hyperparameters, metrics, and the environment in which the model was trained.
- Model Registry: Provide a centralized repository for registering, organizing, and managing models, allowing for collaboration and versioning.
- Model Deployment: Facilitate model deployment by packaging models with their required dependencies and enabling seamless integration with deployment platforms.
Experiment Tracking Flow
This section outlines the systematic approach for managing experiments, registering models, and handling inference tasks in a machine learning pipeline.
Experiment Initialization and Artifact Management
When an experiment is created through the API, several actions take place to set up the necessary infrastructure for tracking and managing models and artifacts.
Experiment Registration
The experiment is registered in the ML Client, which acts as a centralized container for all models and their associated metadata. This structure enables easy tracking and management of multiple models over time.
Artifact Folder Creation
A dedicated folder is created to store all artifacts associated with the experiment. This folder typically includes:
- Preprocessor and model objects
- Feature interpretability plots and tables
- Model performance metrics and scores
- Sample datasets for validation and testing
By organizing artifacts in this way, users can efficiently manage all components related to a particular experiment.
Tracking Preprocessing Artifacts and Model Objects during Model Creation
When the user passes a model object or a path pointing to the trained model, MLflow creates a library-agnostic copy of the model for inference. This ensures uniformity across deployment and prediction tasks, regardless of the machine learning library used for training.
MLflow Model Flavors
MLflow supports various model flavors to handle nuances of different libraries while providing a consistent interface for users. These include:
- scikit-learn
- XGBoost
- TensorFlow
- PyTorch
- Statsmodels
Components of an MLflow Model
- MLmodel file: A configuration file specifying how to load and use the model. It includes metadata about the model's flavor and the paths to necessary files.
- model.pkl: A serialized file containing the trained model's weights, essential for making predictions.
- Environment Files: These include
conda.yaml
,requirements.txt
, andpython_env.yaml
, which specify dependencies for running the model in a consistent and replicable environment.
Artifact Storage
- Polycloud Support: Integration with multiple cloud storage providers (e.g., S3, Azure, GCP) ensures flexibility and compatibility with MLflow.
- Base Artifact Location: During application startup, a base artifact location is initialized with the following directory structure:
EXPERIMENT_ARTIFACT_LOCATION / ENV / CLIENT / EXPERIMENT_NAME / ml_client_model_id
Here, ml_client_model_id
is generated at the time of model creation, providing a unique identifier for each model's artifacts. This systematic organization facilitates efficient management of experiment-related artifacts.
Performance Evaluation through MLflow
The MLOps APIs support tracking training and validation metrics throughout the machine learning lifecycle. Key features include:
- Monitoring performance indicators during training and validation phases.
- Tracking model parameters such as hyperparameters and architecture configurations.
This capability enables users to compare models side-by-side based on their parameters and performance metrics, streamlining the process of selecting the best-performing model for deployment.
Inference with Preprocessing Pipeline
After identifying the best-performing model, it is utilized for inference tasks. Along with the model, the preprocessing pipeline used during training is saved. Key aspects include:
- Ensuring new data undergoes the same transformations applied during training.
- Maintaining the sequence of preprocessing steps to guarantee consistent and accurate predictions.
By preserving the preprocessing pipeline, the system ensures data integrity and reliability during inference, providing seamless transitions from training to production environments
Notebooks
Notebooks provides a centralized, scalable, and collaborative environment for data scientists, analysts, and researchers. It enables teams to access notebooks in a shared infrastructure, making it easier to perform exploratory data analysis, interactive computing, and computational storytelling.
What is a Notebook?
A Notebook is an interactive computing environment that allows users to combine code, markdown text, visualizations, and equations in a single document. It is widely used for:
- Data exploration and visualization
- Machine learning prototyping
- Scientific computing and simulations
- Collaborative research and reporting
Core Features
- Code Execution: Supports multiple programming languages, including Python, R, and Spark
- Rich Outputs: Inline visualizations, tables, and interactive widgets
- Markdown Support: Enables formatted text, LaTeX equations, and embedded media
- Reproducibility: Notebooks can be versioned, shared, and rerun with different datasets
Benefits and Advantages
Notebook extends Jupyter notebooks by providing a multi-user environment where users can work in a scalable and managed infrastructure.
Benefits Over Local IDEs
Feature | Notebooks | Local IDE |
---|---|---|
Collaboration | Shared workspace, multiple users | Limited, requires additional tools |
Scalability | Can run on clusters, cloud, or Kubernetes | Limited to local machine resources |
Resource Management | Centralized administration of compute resources | Manual management of dependencies |
Notebook-based Workflow | Interactive and visual | Script-based, requires additional setup |
Reproducibility | Ensures uniform execution environments | Can vary due to local setups |
Security | Centralized authentication, role-based access | Dependent on local configurations |
SDK Integration | Direct connectivity with platform for data workflows | Requires additional setup |
Advantages for Data Processing & Analysis
- Consistent Execution Environment: Eliminates dependency issues across multiple users
- On-demand Compute Scaling: Run resource-intensive tasks on cloud or clusters
- Interactive Data Exploration: Enables real-time visualization and computation
- Secure Access to Data: Centralized authentication and role-based access
- Streamlined Reproducibility: Ensures that notebooks can be shared and rerun consistently
- SDK Integration: Vue.ai platform SDK can be used within notebooks to connect with the platform, build narratives, and leverage data more effectively
Architecture Components
Key Components
- Hub: The central service managing user authentication and notebook servers
- Proxy: Routes user requests to the appropriate notebook servers
- Spawner: Starts and manages individual Notebook instances for users
Additional Information
For further information on the key components, please visit the JupyterHub documentation.
Use Cases
- Enterprise Data Science: Large teams can collaborate with managed compute resources
- Cloud-based Machine Learning: Train and deploy ML models with scalable infrastructure
- Big Data Analytics: Process and visualize large datasets interactively
- SDK-Powered Data Workflows: Leverage Vue.ai SDK within notebooks for enhanced data processing and insights
Summary
Notebooks provides a scalable, collaborative, and efficient environment for data professionals, eliminating many of the constraints of traditional IDEs and local development setups. With its ability to manage multiple users, allocate resources dynamically, and support interactive data analysis, it is a powerful solution for modern data science workflows. Vue.ai SDK further enhances notebooks by enabling direct connectivity with the platform for data exploration and narrative building.
Customer Hub
The Customer Hub provides powerful tools for creating targeted audiences, managing personalized content, and analyzing customer behavior across digital touchpoints.
Audience
Overview
An audience refers to a group of users who share similar characteristics, interests, or behaviors. Specific strategies can be applied to target these audiences based on their traits and actions.
Examples
- Users who added an iPhone to their cart and purchased it within 2 days.
- Users in the top 1 percentile of affinity to the apparel category.
Features
A feature is a dimension or attribute that describes a user. Audiences can be defined by applying filter conditions to features, which can include demographic information, interests, and behaviors.
Examples
- Gender of the user.
- Number of visits (#Visits).
- Number of purchases (#Buys).
- Affinity to the electronics category.
Feature Groups
A feature group is a logical bundling of related features. For example, the features "Gender" and "Age" may fall under the "Demographics" feature group. Feature groups can also be organized at the brand level.
Examples
- Tatacliq Buys - Last 30 Days describes the number of purchases made on Tatacliq in the past 30 days, and belongs to the "Tatacliq" feature group.
- Croma Buys - Last 30 Days describes the number of purchases made on Croma in the past 30 days, and belongs to the "Croma" feature group.
Sequences
A sequence refers to a specific series of user interactions. An audience can be defined by identifying users who have performed certain sequences in their history.
Examples
- Users who added a Samsung phone to their cart and purchased it within 1 day.
- Users who added an Apple Watch to their cart but did not purchase it within 2 days.
Metrics
Audience metrics measure the performance of defined audiences based on business goals. Metrics include key performance indicators such as conversion rate, revenue, and engagement.
Conditions and Operators
Conditions, Operators, and Values
A condition defines an audience by specifying a feature, an operator, and one or more values.
Example
- Condition: Brand == Nike
- Feature: Brand
- Operator: ==
- Value: Nike
Other examples
- Number of Buys > 3.
- Average Order Value > $50.
Boolean Operators
Boolean operators combine multiple conditions or rules and return a true/false result based on the logical relationships between them.
- AND: Returns true if all conditions are met, false otherwise.
- OR: Returns true if at least one condition is met, false otherwise.
Time Operators
Time operators define time intervals between two events in a sequence.
- None: No time interval between events.
- After & Within: Counts visitors who meet the condition within a specified duration after an event.
- After: Counts visitors who meet the condition after a specified duration.
- Within: Counts visitors who meet the condition within a specified duration.
Logical Operators
Logical operators are used to compare values of features when defining conditions.
- ==: Exactly equal to the given value.
- >: Greater than the given value.
- >=: Greater than or equal to the given value.
- <: Less than the given value.
- <=: Less than or equal to the given value.
- !=: Not equal to the given value.
- IN: Checks if the visitor exists within a given set of values.
- ~: Approximately equal to the given value.
Rules and Groups
Rules
A rule consists of one or more conditions. Each rule is separated by either an AND or OR operator and is used to define an audience.
Groups
A group contains one or more rules. There are two types of groups: Condition Groups and Sequence Groups.
Condition Group
A condition group consists of one or more rules, with each rule defined at the same level. By default, one condition group is present, but users can add any number of additional condition groups.
Sequence Group
A sequence group consists of one or more sequentially executed rules, each representing an event with associated attributes. Time intervals between rules can be defined, and users can add any number of sequence groups.