Basic Concepts
Understanding the core concepts of AutoWDS will help you better use the plugin for data collection.
Rule
What is a Rule?
A rule is a configuration file that defines how to extract data from web pages. It contains the complete collection process, from opening web pages, locating elements, extracting data to saving results.
Rule Components
A complete rule contains the following parts:
1. Start Configuration
- Target website URL
- Browser window size
- HTTP request headers
- Initialization steps
2. Collection Flow
- Page navigation logic
- Data extraction nodes
- Pagination handling
- Deep collection configuration
3. Field Definitions
- Field names and types
- Element selectors
- Data extraction methods
- Data processing rules
4. Save Settings
- Data deduplication rules
- Export format
- Save location
Rule Types
Intelligent Rules
- Automatically generated by AI
- Suitable for standardized pages
- Quick creation, no configuration needed
- Can be manually adjusted and optimized
Visual Rules
- Created through graphical interface
- Supports complex collection scenarios
- Full custom control
- Reusable and shareable
Node
What is a Node?
A node is the basic unit in a rule flow, each node represents a specific operation or data extraction step. Multiple nodes are connected by edges to form a complete collection process.
Node Types
1. Start Node
The entry point of the flow, defining the initial state of collection.
Configuration:
- URL: Starting page address
- Viewport: Browser window size (widthΓheight)
- HTTP Headers: Custom request headers
- Initial Steps: Operations after page load (like login, click, etc.)
Example Configuration:
{
"url": "https://example.com/products",
"viewport": {
"width": 1920,
"height": 1080
},
"httpHeaders": [
{
"header": "User-Agent",
"value": "Mozilla/5.0..."
}
]
}
2. Page Node
Opens a new page or tab.
Trigger Methods:
- Click Element: Click link or button to open new page
- Open URL: Directly visit specified URL
Configuration:
- Type: click_element or open_url
- Value: Element selector or URL address
- Operation Steps: Operations after page opens
Use Cases:
- Enter detail page from list page
- Open search result links
- Navigate to different category pages
3. List Node
Batch extract multiple data records from lists.
Configuration:
- List Container Selector: Container holding all list items
- List Item Selector: Selector for single list item
- Field Configuration: Field definitions to extract
- Pagination Configuration: How to paginate for more data
Example:
{
"listSelector": ".product-list .product-item",
"fields": [
{
"name": "title",
"selector": ".title",
"attr": "innerText"
},
{
"name": "price",
"selector": ".price",
"attr": "innerText"
}
],
"pagination": {
"type": "click_next",
"config": {
"selector": ".next-page"
}
}
}
4. Detail Node
Extract detailed information from a single page.
Configuration:
- Field Configuration: Field definitions to extract
- Data Processing: Data cleaning and transformation rules
Use Cases:
- Extract product detail page information
- Collect complete article content
- Get detailed user profiles
Node Connections
Nodes are connected by edges to form data flow:
Start Node β List Node β Page Node β Detail Node
Execution Order:
- Start from start node
- Execute in connection order
- List node executes subsequent nodes for each data record
- Save data after all paths complete
Selector
What is a Selector?
A selector is an expression used to locate web page elements. Like an address, it tells the plugin which element's data to extract.
CSS Selectors
The most commonly used selector type, with concise and intuitive syntax.
Basic Selectors:
/* Tag selector */
div
h1
span
/* Class selector */
.product
.title
.price
/* ID selector */
#header
#main-content
/* Attribute selector */
[data-id]
[href^="https"]
[class*="product"]
Combination Selectors:
/* Descendant selector */
.product .title
/* Child selector */
.product > .title
/* Adjacent sibling selector */
h2 + p
/* General sibling selector */
h2 ~ p
Pseudo-class Selectors:
/* First child element */
li:first-child
/* Last child element */
li:last-child
/* Nth child element */
li:nth-child(2)
/* Even elements */
li:nth-child(even)
XPath Selectors
More powerful selectors supporting complex queries.
Basic Syntax:
/* Select all div elements */
//div
/* Select div with specific class */
//div[@class='product']
/* Select element containing specific text */
//div[contains(text(), 'Product')]
/* Select element with specific attribute */
//a[@href='/detail']
/* Select parent element */
//div[@class='title']/parent::div
/* Select sibling element */
//div[@class='title']/following-sibling::div
Advanced Usage:
/* Select first element */
(//div[@class='product'])[1]
/* Select last element */
(//div[@class='product'])[last()]
/* Conditional selection */
//div[@class='product' and @data-id='123']
/* Text matching */
//div[text()='Exact match']
//div[contains(text(), 'Contains match')]
Selector Priority
Selector stability from high to low:
- ID Selector - Most stable (if ID doesn't change)
- data- Attributes* - Usually very stable
- Semantic Classes - Relatively stable
- Structural Selectors - Moderately stable
- Dynamic Classes - Unstable
Recommended Practices:
/* Good selectors */
#product-123
[data-product-id="123"]
.product-title
/* Avoid using */
.css-1a2b3c4d /* Dynamically generated class */
div > div > div > span /* Too deep hierarchy */
Field
What is a Field?
A field defines a specific data item to extract, including field name, data source, and processing method.
Field Configuration
Basic Properties:
{
"id": "field_001",
"name": "title",
"selector": ".product-title",
"attr": "innerText"
}
Property Description:
- id: Field unique identifier
- name: Field name (column name when exporting)
- selector: Element selector
- attr: Attribute to extract
Extraction Attributes
Common Attributes:
| Attribute | Description | Example |
|---|---|---|
| innerText | Element text content | Product title |
| innerHTML | Element HTML content | <span>Title</span> |
| href | Link address | https://example.com |
| src | Image/resource address | https://example.com/img.jpg |
| value | Form input value | User input text |
| title | Title attribute | Mouse hover tooltip |
| data-* | Custom data attribute | data-id="123" |
Extraction Examples:
// Extract text
selector: ".title"
attr: "innerText"
// Result: "Product Title"
// Extract link
selector: "a.detail-link"
attr: "href"
// Result: "https://example.com/product/123"
// Extract image
selector: "img.product-img"
attr: "src"
// Result: "https://example.com/images/product.jpg"
// Extract custom attribute
selector: ".product"
attr: "data-id"
// Result: "123"
Extractor
Further process extracted raw data.
Regular Expression Extractor
Use regular expressions to extract specific patterns from text.
Configuration:
{
"type": "regex",
"code": "\\d+\\.\\d+"
}
Example:
Raw data: "Price: $199.99"
Regex: \d+\.\d+
Result: "199.99"
Sed Replacement Extractor
Use sed commands for text replacement.
Configuration:
{
"type": "sed",
"code": "s/\\$//g;s/Price://g"
}
Example:
Raw data: "Price: $199.99"
Sed command: s/\$//g;s/Price://g
Result: "199.99"
JavaScript Extractor
Use JavaScript code for complex processing.
Configuration:
{
"type": "js",
"code": "return value.replace(/[^0-9.]/g, '')"
}
Example:
// Clean price data
return value.replace(/[^0-9.]/g, '')
// Convert date format
return new Date(value).toISOString()
// Calculate discount
const original = parseFloat(data.originalPrice)
const current = parseFloat(value)
return ((1 - current/original) * 100).toFixed(2) + '%'
Pagination
What is Pagination?
Pagination is the functionality to automatically browse multiple pages of content to collect more data.
Pagination Types
1. Click Next (click_next)
Click "Next Page" button or page number links.
Configuration:
{
"type": "click_next",
"config": {
"selector": ".next-page"
}
}
Use Cases:
- Traditional pagination navigation
- Page number links
- "Next Page" button
2. Infinite Scroll (scroll)
Scroll to page bottom to trigger load more.
Configuration:
{
"type": "scroll",
"config": {
"selector": "body"
}
}
Use Cases:
- Social media feeds
- Waterfall layout
- Auto-load more lists
3. Load More (load_more)
Click "Load More" button.
Configuration:
{
"type": "load_more",
"config": {
"selector": ".load-more-btn"
}
}
Use Cases:
- Lists requiring click to load
- "View More" button
- "Expand All" functionality
Steps
What are Steps?
Steps are automated operation sequences executed on pages, such as clicking, inputting, scrolling, etc.
Step Types
Click
Click page element.
Configuration:
{
"type": "click",
"selectors": [["button.submit"]]
}
Double Click
Double-click page element.
Configuration:
{
"type": "doubleClick",
"selectors": [[".item"]]
}
Change (Input)
Input text in input box.
Configuration:
{
"type": "change",
"selectors": [["input[name='keyword']"]],
"value": "Search keyword"
}
Key Down/Up
Simulate keyboard press.
Configuration:
{
"type": "keyDown",
"key": "Enter"
}
Navigate
Jump to specified URL.
Configuration:
{
"type": "navigate",
"url": "https://example.com/page2"
}
Scroll
Scroll page.
Configuration:
{
"type": "scroll",
"x": 0,
"y": 1000
}
Step Combinations
Multiple steps can be combined into complex operation flows:
Example: Search and Collect
{
"steps": [
{
"type": "change",
"selectors": [["input.search"]],
"value": "keyword"
},
{
"type": "click",
"selectors": [["button.search-btn"]]
},
{
"type": "wait",
"timeout": 2000
}
]
}
Working Principle
Collection Flow
1. Load starting page
β
2. Execute initial steps (like login)
β
3. Locate list container
β
4. Extract list data
β
5. For each data record:
- Open detail page (if needed)
- Extract detailed information
- Merge data
β
6. Paginate (if more pages)
β
7. Repeat steps 3-6
β
8. Save all data
Data Flow
Web Element β Selector Location β Attribute Extraction β Data Processing β Field Value β Data Record β Export File
Execution Context
Each node has a context when executing, containing:
- Current Page: Browser page being operated
- Current Node: Node configuration being executed
- Data Path: Data accumulated from start to current
- Log Recording: Execution process log information
Best Practices
1. Selector Writing
β Recommended:
/* Use stable attributes */
[data-product-id]
.product-title
#main-content
/* Appropriate hierarchy */
.product-list > .item
β Avoid:
/* Dynamically generated class */
.css-1a2b3c
/* Too deep hierarchy */
body > div > div > div > div > span
/* Position dependent */
div:nth-child(5)
2. Data Extraction
β Recommended:
- Extract raw data, process later
- Use appropriate extractors to clean data
- Set default values for missing data
β Avoid:
- Extract content with lots of HTML tags
- Don't process special characters
- Ignore data validation
3. Pagination Handling
β Recommended:
- Set reasonable pagination delays
- Add pagination count limits
- Check if more data available
β Avoid:
- Unlimited pagination
- Paginate too fast causing blocks
- Don't handle pagination failures
4. Error Handling
β Recommended:
- Add wait time to ensure page loads
- Set timeout durations
- Record detailed error logs
β Avoid:
- Assume elements always exist
- Don't handle network errors
- Ignore exception cases
Next Steps
Now that you understand the basic concepts, you can continue learning:
- Management Interface - Learn about plugin functional modules
- Intelligent Collection - Use AI to quickly create collection rules
- Visual Rule Development - Create complex collection rules
- Tutorials - Learn data collection through examples