Basic Concepts

Understanding the core concepts of AutoWDS will help you better use the plugin for data collection.

Rule

What is a Rule?

A rule is a configuration file that defines how to extract data from web pages. It contains the complete collection process, from opening web pages, locating elements, extracting data to saving results.

Rule Components

A complete rule contains the following parts:

1. Start Configuration

Target website URL
Browser window size
HTTP request headers
Initialization steps

2. Collection Flow

Page navigation logic
Data extraction nodes
Pagination handling
Deep collection configuration

3. Field Definitions

Field names and types
Element selectors
Data extraction methods
Data processing rules

4. Save Settings

Data deduplication rules
Export format
Save location

Rule Types

Intelligent Rules

Automatically generated by AI
Suitable for standardized pages
Quick creation, no configuration needed
Can be manually adjusted and optimized

Visual Rules

Created through graphical interface
Supports complex collection scenarios
Full custom control
Reusable and shareable

Node

What is a Node?

A node is the basic unit in a rule flow, each node represents a specific operation or data extraction step. Multiple nodes are connected by edges to form a complete collection process.

Node Types

1. Start Node

The entry point of the flow, defining the initial state of collection.

Configuration:

URL: Starting page address
Viewport: Browser window size (width×height)
HTTP Headers: Custom request headers
Initial Steps: Operations after page load (like login, click, etc.)

Example Configuration:

{
  "url": "https://example.com/products",
  "viewport": {
    "width": 1920,
    "height": 1080
  },
  "httpHeaders": [
    {
      "header": "User-Agent",
      "value": "Mozilla/5.0..."
    }
  ]
}

2. Page Node

Opens a new page or tab.

Trigger Methods:

Click Element: Click link or button to open new page
Open URL: Directly visit specified URL

Configuration:

Type: click_element or open_url
Value: Element selector or URL address
Operation Steps: Operations after page opens

Use Cases:

Enter detail page from list page
Open search result links
Navigate to different category pages

3. List Node

Batch extract multiple data records from lists.

Configuration:

List Container Selector: Container holding all list items
List Item Selector: Selector for single list item
Field Configuration: Field definitions to extract
Pagination Configuration: How to paginate for more data

Example:

{
  "listSelector": ".product-list .product-item",
  "fields": [
    {
      "name": "title",
      "selector": ".title",
      "attr": "innerText"
    },
    {
      "name": "price",
      "selector": ".price",
      "attr": "innerText"
    }
  ],
  "pagination": {
    "type": "click_next",
    "config": {
      "selector": ".next-page"
    }
  }
}

4. Detail Node

Extract detailed information from a single page.

Configuration:

Field Configuration: Field definitions to extract
Data Processing: Data cleaning and transformation rules

Use Cases:

Extract product detail page information
Collect complete article content
Get detailed user profiles

Node Connections

Nodes are connected by edges to form data flow:

Start Node → List Node → Page Node → Detail Node

Execution Order:

Start from start node
Execute in connection order
List node executes subsequent nodes for each data record
Save data after all paths complete

Selector

What is a Selector?

A selector is an expression used to locate web page elements. Like an address, it tells the plugin which element's data to extract.

CSS Selectors

The most commonly used selector type, with concise and intuitive syntax.

Basic Selectors:

/* Tag selector */
div
h1
span

/* Class selector */
.product
.title
.price

/* ID selector */
#header
#main-content

/* Attribute selector */
[data-id]
[href^="https"]
[class*="product"]

Combination Selectors:

/* Descendant selector */
.product .title

/* Child selector */
.product > .title

/* Adjacent sibling selector */
h2 + p

/* General sibling selector */
h2 ~ p

Pseudo-class Selectors:

/* First child element */
li:first-child

/* Last child element */
li:last-child

/* Nth child element */
li:nth-child(2)

/* Even elements */
li:nth-child(even)

XPath Selectors

More powerful selectors supporting complex queries.

Basic Syntax:

/* Select all div elements */
//div

/* Select div with specific class */
//div[@class='product']

/* Select element containing specific text */
//div[contains(text(), 'Product')]

/* Select element with specific attribute */
//a[@href='/detail']

/* Select parent element */
//div[@class='title']/parent::div

/* Select sibling element */
//div[@class='title']/following-sibling::div

Advanced Usage:

/* Select first element */
(//div[@class='product'])[1]

/* Select last element */
(//div[@class='product'])[last()]

/* Conditional selection */
//div[@class='product' and @data-id='123']

/* Text matching */
//div[text()='Exact match']
//div[contains(text(), 'Contains match')]

Selector Priority

Selector stability from high to low:

ID Selector - Most stable (if ID doesn't change)
data- Attributes* - Usually very stable
Semantic Classes - Relatively stable
Structural Selectors - Moderately stable
Dynamic Classes - Unstable

Recommended Practices:

/* Good selectors */
#product-123
[data-product-id="123"]
.product-title

/* Avoid using */
.css-1a2b3c4d  /* Dynamically generated class */
div > div > div > span  /* Too deep hierarchy */

Field

What is a Field?

A field defines a specific data item to extract, including field name, data source, and processing method.

Field Configuration

Basic Properties:

{
  "id": "field_001",
  "name": "title",
  "selector": ".product-title",
  "attr": "innerText"
}

Property Description:

id: Field unique identifier
name: Field name (column name when exporting)
selector: Element selector
attr: Attribute to extract

Extraction Attributes

Common Attributes:

Attribute	Description	Example
innerText	Element text content	Product title
innerHTML	Element HTML content	`<span>Title</span>`
href	Link address	https://example.com
src	Image/resource address	https://example.com/img.jpg
value	Form input value	User input text
title	Title attribute	Mouse hover tooltip
data-*	Custom data attribute	data-id="123"

Extraction Examples:

// Extract text
selector: ".title"
attr: "innerText"
// Result: "Product Title"

// Extract link
selector: "a.detail-link"
attr: "href"
// Result: "https://example.com/product/123"

// Extract image
selector: "img.product-img"
attr: "src"
// Result: "https://example.com/images/product.jpg"

// Extract custom attribute
selector: ".product"
attr: "data-id"
// Result: "123"

Extractor

Further process extracted raw data.

Regular Expression Extractor

Use regular expressions to extract specific patterns from text.

Configuration:

{
  "type": "regex",
  "code": "\\d+\\.\\d+"
}

Example:

Raw data: "Price: $199.99"
Regex: \d+\.\d+
Result: "199.99"

Sed Replacement Extractor

Use sed commands for text replacement.

Configuration:

{
  "type": "sed",
  "code": "s/\\$//g;s/Price://g"
}

Example:

Raw data: "Price: $199.99"
Sed command: s/\$//g;s/Price://g
Result: "199.99"

JavaScript Extractor

Use JavaScript code for complex processing.

Configuration:

{
  "type": "js",
  "code": "return value.replace(/[^0-9.]/g, '')"
}

Example:

// Clean price data
return value.replace(/[^0-9.]/g, '')

// Convert date format
return new Date(value).toISOString()

// Calculate discount
const original = parseFloat(data.originalPrice)
const current = parseFloat(value)
return ((1 - current/original) * 100).toFixed(2) + '%'

Pagination

What is Pagination?

Pagination is the functionality to automatically browse multiple pages of content to collect more data.

Pagination Types

1. Click Next (click_next)

Click "Next Page" button or page number links.

Configuration:

{
  "type": "click_next",
  "config": {
    "selector": ".next-page"
  }
}

Use Cases:

Traditional pagination navigation
Page number links
"Next Page" button

2. Infinite Scroll (scroll)

Scroll to page bottom to trigger load more.

Configuration:

{
  "type": "scroll",
  "config": {
    "selector": "body"
  }
}

Use Cases:

Social media feeds
Waterfall layout
Auto-load more lists

3. Load More (load_more)

Click "Load More" button.

Configuration:

{
  "type": "load_more",
  "config": {
    "selector": ".load-more-btn"
  }
}

Use Cases:

Lists requiring click to load
"View More" button
"Expand All" functionality

Steps

What are Steps?

Steps are automated operation sequences executed on pages, such as clicking, inputting, scrolling, etc.

Step Types

Click

Click page element.

Configuration:

{
  "type": "click",
  "selectors": [["button.submit"]]
}

Double Click

Double-click page element.

Configuration:

{
  "type": "doubleClick",
  "selectors": [[".item"]]
}

Change (Input)

Input text in input box.

Configuration:

{
  "type": "change",
  "selectors": [["input[name='keyword']"]],
  "value": "Search keyword"
}

Key Down/Up

Simulate keyboard press.

Configuration:

{
  "type": "keyDown",
  "key": "Enter"
}

Navigate

Jump to specified URL.

Configuration:

{
  "type": "navigate",
  "url": "https://example.com/page2"
}

Scroll

Scroll page.

Configuration:

{
  "type": "scroll",
  "x": 0,
  "y": 1000
}

Step Combinations

Multiple steps can be combined into complex operation flows:

Example: Search and Collect

{
  "steps": [
    {
      "type": "change",
      "selectors": [["input.search"]],
      "value": "keyword"
    },
    {
      "type": "click",
      "selectors": [["button.search-btn"]]
    },
    {
      "type": "wait",
      "timeout": 2000
    }
  ]
}

Working Principle

Collection Flow

1. Load starting page
   ↓
2. Execute initial steps (like login)
   ↓
3. Locate list container
   ↓
4. Extract list data
   ↓
5. For each data record:
   - Open detail page (if needed)
   - Extract detailed information
   - Merge data
   ↓
6. Paginate (if more pages)
   ↓
7. Repeat steps 3-6
   ↓
8. Save all data

Data Flow

Web Element → Selector Location → Attribute Extraction → Data Processing → Field Value → Data Record → Export File

Execution Context

Each node has a context when executing, containing:

Current Page: Browser page being operated
Current Node: Node configuration being executed
Data Path: Data accumulated from start to current
Log Recording: Execution process log information

Best Practices

1. Selector Writing

✅ Recommended:

/* Use stable attributes */
[data-product-id]
.product-title
#main-content

/* Appropriate hierarchy */
.product-list > .item

❌ Avoid:

/* Dynamically generated class */
.css-1a2b3c

/* Too deep hierarchy */
body > div > div > div > div > span

/* Position dependent */
div:nth-child(5)

2. Data Extraction

✅ Recommended:

Extract raw data, process later
Use appropriate extractors to clean data
Set default values for missing data

❌ Avoid:

Extract content with lots of HTML tags
Don't process special characters
Ignore data validation

3. Pagination Handling

✅ Recommended:

Set reasonable pagination delays
Add pagination count limits
Check if more data available

❌ Avoid:

Unlimited pagination
Paginate too fast causing blocks
Don't handle pagination failures

4. Error Handling

✅ Recommended:

Add wait time to ensure page loads
Set timeout durations
Record detailed error logs

❌ Avoid:

Assume elements always exist
Don't handle network errors
Ignore exception cases

Next Steps

Now that you understand the basic concepts, you can continue learning:

Management Interface - Learn about plugin functional modules
Intelligent Collection - Use AI to quickly create collection rules
Visual Rule Development - Create complex collection rules
Tutorials - Learn data collection through examples