Basic Concepts

Understanding the core concepts of AutoWDS will help you better use the plugin for data collection.

Rule

What is a Rule?

A rule is a configuration file that defines how to extract data from web pages. It contains the complete collection process, from opening web pages, locating elements, extracting data to saving results.

Rule Components

A complete rule contains the following parts:

1. Start Configuration

  • Target website URL
  • Browser window size
  • HTTP request headers
  • Initialization steps

2. Collection Flow

  • Page navigation logic
  • Data extraction nodes
  • Pagination handling
  • Deep collection configuration

3. Field Definitions

  • Field names and types
  • Element selectors
  • Data extraction methods
  • Data processing rules

4. Save Settings

  • Data deduplication rules
  • Export format
  • Save location

Rule Types

Intelligent Rules

  • Automatically generated by AI
  • Suitable for standardized pages
  • Quick creation, no configuration needed
  • Can be manually adjusted and optimized

Visual Rules

  • Created through graphical interface
  • Supports complex collection scenarios
  • Full custom control
  • Reusable and shareable

Node

What is a Node?

A node is the basic unit in a rule flow, each node represents a specific operation or data extraction step. Multiple nodes are connected by edges to form a complete collection process.

Node Types

1. Start Node

The entry point of the flow, defining the initial state of collection.

Configuration:

  • URL: Starting page address
  • Viewport: Browser window size (widthΓ—height)
  • HTTP Headers: Custom request headers
  • Initial Steps: Operations after page load (like login, click, etc.)

Example Configuration:

{
  "url": "https://example.com/products",
  "viewport": {
    "width": 1920,
    "height": 1080
  },
  "httpHeaders": [
    {
      "header": "User-Agent",
      "value": "Mozilla/5.0..."
    }
  ]
}

2. Page Node

Opens a new page or tab.

Trigger Methods:

  • Click Element: Click link or button to open new page
  • Open URL: Directly visit specified URL

Configuration:

  • Type: click_element or open_url
  • Value: Element selector or URL address
  • Operation Steps: Operations after page opens

Use Cases:

  • Enter detail page from list page
  • Open search result links
  • Navigate to different category pages

3. List Node

Batch extract multiple data records from lists.

Configuration:

  • List Container Selector: Container holding all list items
  • List Item Selector: Selector for single list item
  • Field Configuration: Field definitions to extract
  • Pagination Configuration: How to paginate for more data

Example:

{
  "listSelector": ".product-list .product-item",
  "fields": [
    {
      "name": "title",
      "selector": ".title",
      "attr": "innerText"
    },
    {
      "name": "price",
      "selector": ".price",
      "attr": "innerText"
    }
  ],
  "pagination": {
    "type": "click_next",
    "config": {
      "selector": ".next-page"
    }
  }
}

4. Detail Node

Extract detailed information from a single page.

Configuration:

  • Field Configuration: Field definitions to extract
  • Data Processing: Data cleaning and transformation rules

Use Cases:

  • Extract product detail page information
  • Collect complete article content
  • Get detailed user profiles

Node Connections

Nodes are connected by edges to form data flow:

Start Node β†’ List Node β†’ Page Node β†’ Detail Node

Execution Order:

  1. Start from start node
  2. Execute in connection order
  3. List node executes subsequent nodes for each data record
  4. Save data after all paths complete

Selector

What is a Selector?

A selector is an expression used to locate web page elements. Like an address, it tells the plugin which element's data to extract.

CSS Selectors

The most commonly used selector type, with concise and intuitive syntax.

Basic Selectors:

/* Tag selector */
div
h1
span

/* Class selector */
.product
.title
.price

/* ID selector */
#header
#main-content

/* Attribute selector */
[data-id]
[href^="https"]
[class*="product"]

Combination Selectors:

/* Descendant selector */
.product .title

/* Child selector */
.product > .title

/* Adjacent sibling selector */
h2 + p

/* General sibling selector */
h2 ~ p

Pseudo-class Selectors:

/* First child element */
li:first-child

/* Last child element */
li:last-child

/* Nth child element */
li:nth-child(2)

/* Even elements */
li:nth-child(even)

XPath Selectors

More powerful selectors supporting complex queries.

Basic Syntax:

/* Select all div elements */
//div

/* Select div with specific class */
//div[@class='product']

/* Select element containing specific text */
//div[contains(text(), 'Product')]

/* Select element with specific attribute */
//a[@href='/detail']

/* Select parent element */
//div[@class='title']/parent::div

/* Select sibling element */
//div[@class='title']/following-sibling::div

Advanced Usage:

/* Select first element */
(//div[@class='product'])[1]

/* Select last element */
(//div[@class='product'])[last()]

/* Conditional selection */
//div[@class='product' and @data-id='123']

/* Text matching */
//div[text()='Exact match']
//div[contains(text(), 'Contains match')]

Selector Priority

Selector stability from high to low:

  1. ID Selector - Most stable (if ID doesn't change)
  2. data- Attributes* - Usually very stable
  3. Semantic Classes - Relatively stable
  4. Structural Selectors - Moderately stable
  5. Dynamic Classes - Unstable

Recommended Practices:

/* Good selectors */
#product-123
[data-product-id="123"]
.product-title

/* Avoid using */
.css-1a2b3c4d  /* Dynamically generated class */
div > div > div > span  /* Too deep hierarchy */

Field

What is a Field?

A field defines a specific data item to extract, including field name, data source, and processing method.

Field Configuration

Basic Properties:

{
  "id": "field_001",
  "name": "title",
  "selector": ".product-title",
  "attr": "innerText"
}

Property Description:

  • id: Field unique identifier
  • name: Field name (column name when exporting)
  • selector: Element selector
  • attr: Attribute to extract

Extraction Attributes

Common Attributes:

AttributeDescriptionExample
innerTextElement text contentProduct title
innerHTMLElement HTML content<span>Title</span>
hrefLink addresshttps://example.com
srcImage/resource addresshttps://example.com/img.jpg
valueForm input valueUser input text
titleTitle attributeMouse hover tooltip
data-*Custom data attributedata-id="123"

Extraction Examples:

// Extract text
selector: ".title"
attr: "innerText"
// Result: "Product Title"

// Extract link
selector: "a.detail-link"
attr: "href"
// Result: "https://example.com/product/123"

// Extract image
selector: "img.product-img"
attr: "src"
// Result: "https://example.com/images/product.jpg"

// Extract custom attribute
selector: ".product"
attr: "data-id"
// Result: "123"

Extractor

Further process extracted raw data.

Regular Expression Extractor

Use regular expressions to extract specific patterns from text.

Configuration:

{
  "type": "regex",
  "code": "\\d+\\.\\d+"
}

Example:

Raw data: "Price: $199.99"
Regex: \d+\.\d+
Result: "199.99"

Sed Replacement Extractor

Use sed commands for text replacement.

Configuration:

{
  "type": "sed",
  "code": "s/\\$//g;s/Price://g"
}

Example:

Raw data: "Price: $199.99"
Sed command: s/\$//g;s/Price://g
Result: "199.99"

JavaScript Extractor

Use JavaScript code for complex processing.

Configuration:

{
  "type": "js",
  "code": "return value.replace(/[^0-9.]/g, '')"
}

Example:

// Clean price data
return value.replace(/[^0-9.]/g, '')

// Convert date format
return new Date(value).toISOString()

// Calculate discount
const original = parseFloat(data.originalPrice)
const current = parseFloat(value)
return ((1 - current/original) * 100).toFixed(2) + '%'

Pagination

What is Pagination?

Pagination is the functionality to automatically browse multiple pages of content to collect more data.

Pagination Types

1. Click Next (click_next)

Click "Next Page" button or page number links.

Configuration:

{
  "type": "click_next",
  "config": {
    "selector": ".next-page"
  }
}

Use Cases:

  • Traditional pagination navigation
  • Page number links
  • "Next Page" button

2. Infinite Scroll (scroll)

Scroll to page bottom to trigger load more.

Configuration:

{
  "type": "scroll",
  "config": {
    "selector": "body"
  }
}

Use Cases:

  • Social media feeds
  • Waterfall layout
  • Auto-load more lists

3. Load More (load_more)

Click "Load More" button.

Configuration:

{
  "type": "load_more",
  "config": {
    "selector": ".load-more-btn"
  }
}

Use Cases:

  • Lists requiring click to load
  • "View More" button
  • "Expand All" functionality

Steps

What are Steps?

Steps are automated operation sequences executed on pages, such as clicking, inputting, scrolling, etc.

Step Types

Click

Click page element.

Configuration:

{
  "type": "click",
  "selectors": [["button.submit"]]
}

Double Click

Double-click page element.

Configuration:

{
  "type": "doubleClick",
  "selectors": [[".item"]]
}

Change (Input)

Input text in input box.

Configuration:

{
  "type": "change",
  "selectors": [["input[name='keyword']"]],
  "value": "Search keyword"
}

Key Down/Up

Simulate keyboard press.

Configuration:

{
  "type": "keyDown",
  "key": "Enter"
}

Navigate

Jump to specified URL.

Configuration:

{
  "type": "navigate",
  "url": "https://example.com/page2"
}

Scroll

Scroll page.

Configuration:

{
  "type": "scroll",
  "x": 0,
  "y": 1000
}

Step Combinations

Multiple steps can be combined into complex operation flows:

Example: Search and Collect

{
  "steps": [
    {
      "type": "change",
      "selectors": [["input.search"]],
      "value": "keyword"
    },
    {
      "type": "click",
      "selectors": [["button.search-btn"]]
    },
    {
      "type": "wait",
      "timeout": 2000
    }
  ]
}

Working Principle

Collection Flow

1. Load starting page
   ↓
2. Execute initial steps (like login)
   ↓
3. Locate list container
   ↓
4. Extract list data
   ↓
5. For each data record:
   - Open detail page (if needed)
   - Extract detailed information
   - Merge data
   ↓
6. Paginate (if more pages)
   ↓
7. Repeat steps 3-6
   ↓
8. Save all data

Data Flow

Web Element β†’ Selector Location β†’ Attribute Extraction β†’ Data Processing β†’ Field Value β†’ Data Record β†’ Export File

Execution Context

Each node has a context when executing, containing:

  • Current Page: Browser page being operated
  • Current Node: Node configuration being executed
  • Data Path: Data accumulated from start to current
  • Log Recording: Execution process log information

Best Practices

1. Selector Writing

βœ… Recommended:

/* Use stable attributes */
[data-product-id]
.product-title
#main-content

/* Appropriate hierarchy */
.product-list > .item

❌ Avoid:

/* Dynamically generated class */
.css-1a2b3c

/* Too deep hierarchy */
body > div > div > div > div > span

/* Position dependent */
div:nth-child(5)

2. Data Extraction

βœ… Recommended:

  • Extract raw data, process later
  • Use appropriate extractors to clean data
  • Set default values for missing data

❌ Avoid:

  • Extract content with lots of HTML tags
  • Don't process special characters
  • Ignore data validation

3. Pagination Handling

βœ… Recommended:

  • Set reasonable pagination delays
  • Add pagination count limits
  • Check if more data available

❌ Avoid:

  • Unlimited pagination
  • Paginate too fast causing blocks
  • Don't handle pagination failures

4. Error Handling

βœ… Recommended:

  • Add wait time to ensure page loads
  • Set timeout durations
  • Record detailed error logs

❌ Avoid:

  • Assume elements always exist
  • Don't handle network errors
  • Ignore exception cases

Next Steps

Now that you understand the basic concepts, you can continue learning: