The renewal maintenance has officially ended for Progress iMacros effective November 30, 2023.
This Wiki site will also no longer be moderated from the Progress side.
Thank you again for your business and support.
Sincerely, The Progress Team
Difference between revisions of "Data Extraction"
(17 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | + | __TOC__ | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | == Data Extraction and Web Scraping [[Image:IMacros-icon.png|iMacros Browser]] [[Image:Ie-icon.png|IE Plug-in]] [[Image:Ff-icon.png|Firefox]]== | + | == Data Extraction and Web Scraping [[Image:IMacros-icon.png|iMacros Browser]] [[Image:Ie-icon.png|IE Plug-in]] [[Image:Ff-icon.png|Firefox]] [[Image:Cr-icon.png|Chrome]]== |
A key activity in web automation is the extraction of data from websites, also known as web scraping or screen scraping. Whether it is price lists, stock information, financial data or any other type of data, iMacros can extract this data for you and either re-use the data or store it in a file or database. | A key activity in web automation is the extraction of data from websites, also known as web scraping or screen scraping. Whether it is price lists, stock information, financial data or any other type of data, iMacros can extract this data for you and either re-use the data or store it in a file or database. | ||
Line 14: | Line 7: | ||
iMacros can write extracted data to standard text files, including the comma separated value (.csv) format, readable by spreadsheet processing packages. Also, iMacros can make use of the powerful scripting interface to save data directly to databases. | iMacros can write extracted data to standard text files, including the comma separated value (.csv) format, readable by spreadsheet processing packages. Also, iMacros can make use of the powerful scripting interface to save data directly to databases. | ||
− | + | ||
=== The Extract command === | === The Extract command === | ||
Line 29: | Line 22: | ||
=== Extraction Wizard [[Image:IMacros-icon.png|iMacros Browser]] [[Image:Ie-icon.png|IE Plug-in]]=== | === Extraction Wizard [[Image:IMacros-icon.png|iMacros Browser]] [[Image:Ie-icon.png|IE Plug-in]]=== | ||
− | [[Image:TextExtract.png|thumb| | + | [[Image:TextExtract.png|thumb|750px|Text Extraction Wizard]] |
The Extraction Wizard can be used to automatically generate and test extractions. | The Extraction Wizard can be used to automatically generate and test extractions. | ||
Line 45: | Line 38: | ||
[[Image:IMacros-icon.png|iMacros Browser]]'''Note:''' The extraction wizard is '''only available in the iMacros Browser and iMacros for Internet Explorer''' But the generated commands can be used in all iMacros versions. | [[Image:IMacros-icon.png|iMacros Browser]]'''Note:''' The extraction wizard is '''only available in the iMacros Browser and iMacros for Internet Explorer''' But the generated commands can be used in all iMacros versions. | ||
+ | |||
+ | |||
+ | |||
=== Extraction from Framed Websites === | === Extraction from Framed Websites === | ||
Line 52: | Line 48: | ||
When recording a TAG command the FRAME command will automatically be generated. | When recording a TAG command the FRAME command will automatically be generated. | ||
− | URL GOTO=http:// | + | URL GOTO=http://demo.imacros.net/Automate/Frames |
FRAME F=5 | FRAME F=5 | ||
TAG POS=1 TYPE=P ATTR=TXT:<SP>Frame5 | TAG POS=1 TYPE=P ATTR=TXT:<SP>Frame5 | ||
Line 58: | Line 54: | ||
Within the Extraction Wizard, when selecting the data to be extracted the FRAME command will automatically be generated. | Within the Extraction Wizard, when selecting the data to be extracted the FRAME command will automatically be generated. | ||
− | URL GOTO=http:// | + | URL GOTO=http://demo.imacros.net/Automate/Frames |
FRAME F=3 | FRAME F=3 | ||
TAG POS=1 TYPE=P ATTR=TXT:* EXTRACT=TXT | TAG POS=1 TYPE=P ATTR=TXT:* EXTRACT=TXT | ||
Line 68: | Line 64: | ||
TAG POS=1 TYPE=TD ATTR=CLASS:NewLatestResultsLotto&&TXT:* EXTRACT=TXT | TAG POS=1 TYPE=TD ATTR=CLASS:NewLatestResultsLotto&&TXT:* EXTRACT=TXT | ||
− | Starting with iMacros 10.2, you can extract any attribute. If you want to know which attributes are available, record the TAG command in ''[ | + | Starting with iMacros 10.2, you can extract any attribute. If you want to know which attributes are available, record the TAG command in ''[[First_Steps#Tag Expert (Complete HTML)|Expert Mode]]''. |
'''Example:''' | '''Example:''' | ||
Line 87: | Line 83: | ||
− | ComputerName=* is the initial immutable part of the tooltip content and is used to specify which tooltip should be extracted. Recording in ''[ | + | ComputerName=* is the initial immutable part of the tooltip content and is used to specify which tooltip should be extracted. Recording in ''[[First_Steps#Tag Expert (Complete HTML)|Expert Mode]]'' will create a TAG command that consists of all attributes of the clicked HTML element. |
− | === Extract Complete Website [[Image:IMacros-icon.png|iMacros Browser]] [[Image:Ie-icon.png|IE Plug-in]] [[Image:Ff-icon.png|Firefox]]=== | + | === Extract Complete Website [[Image:IMacros-icon.png|iMacros Browser]] [[Image:Ie-icon.png|IE Plug-in]] [[Image:Ff-icon.png|Firefox]] [[Image:Cr-icon.png|Chrome]]=== |
To extract a complete web page (or the complete header or body) you need to manually insert the appropriate TAG line. Please see the examples: | To extract a complete web page (or the complete header or body) you need to manually insert the appropriate TAG line. Please see the examples: | ||
− | URL GOTO=http://www. | + | URL GOTO=http://www.imacros.net |
'Complete Page | 'Complete Page | ||
TAG POS=1 TYPE=HTML ATTR=* EXTRACT=HTM | TAG POS=1 TYPE=HTML ATTR=* EXTRACT=HTM | ||
Line 105: | Line 101: | ||
Alternatively you can use the [[SAVEAS]] command to save the complete web page. | Alternatively you can use the [[SAVEAS]] command to save the complete web page. | ||
− | For an example using the [[SEARCH]] command, please see the following forum post: [http://forum. | + | For an example using the [[SEARCH]] command, please see the following forum post: [http://forum.imacros.net/viewtopic.php?f=7&t=11200 Most Efficient Way To Extract Source Code] |
=== Extract Table [[Image:IMacros-icon.png|iMacros Browser]] [[Image:Ie-icon.png|IE Plug-in]] [[Image:Ff-icon.png|Firefox]]=== | === Extract Table [[Image:IMacros-icon.png|iMacros Browser]] [[Image:Ie-icon.png|IE Plug-in]] [[Image:Ff-icon.png|Firefox]]=== | ||
− | [[File:ExtractionWizard.png|left| | + | [[File:ExtractionWizard.png|left|750px|thumb|Table extraction with the Text Extraction Wizard]] |
Use TAG TYPE=TABLE ... to extract the content of a complete table with one command. Example: [[Demo-Extract-Table]]. | Use TAG TYPE=TABLE ... to extract the content of a complete table with one command. Example: [[Demo-Extract-Table]]. | ||
Line 120: | Line 116: | ||
You can use the Text Extraction Wizard to see the resulting extracted table, but in this case, for visual simplicity, the inner tables are shown as plain text, without the delimiters. | You can use the Text Extraction Wizard to see the resulting extracted table, but in this case, for visual simplicity, the inner tables are shown as plain text, without the delimiters. | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
=== Extract Page Title [[Image:IMacros-icon.png|iMacros Browser]] [[Image:Ie-icon.png|IE Plug-in]] [[Image:Ff-icon.png|Firefox]]=== | === Extract Page Title [[Image:IMacros-icon.png|iMacros Browser]] [[Image:Ie-icon.png|IE Plug-in]] [[Image:Ff-icon.png|Firefox]]=== | ||
Line 125: | Line 136: | ||
To extract a title of a website you need to manually insert the appropriate TAG line with '''TYPE=TITLE'''. This TAG command finds the page's title element. Please see the example: | To extract a title of a website you need to manually insert the appropriate TAG line with '''TYPE=TITLE'''. This TAG command finds the page's title element. Please see the example: | ||
− | URL GOTO=http://www. | + | URL GOTO=http://www.imacros.net/ |
TAG POS=1 TYPE=TITLE ATTR=* EXTRACT=TXT | TAG POS=1 TYPE=TITLE ATTR=* EXTRACT=TXT | ||
Line 153: | Line 164: | ||
The [[SAVEAS]] command erases the content of the !EXTRACT variable afterward. With the next start of the macro or the next round of a loop a new line is added to the file. | The [[SAVEAS]] command erases the content of the !EXTRACT variable afterward. With the next start of the macro or the next round of a loop a new line is added to the file. | ||
+ | |||
+ | ==== Saving arbitrary values ==== | ||
+ | |||
+ | In addition to extracting data from web pages, you can also save any arbitrary value to your output file by adding it to the extraction buffer with the following syntax: | ||
+ | ADD !EXTRACT <value> | ||
+ | Where <value> is replaced with the actual value you want to save (can also be an [[EVAL]] expression). For example, the following will add a time stamp column to the output: | ||
+ | ADD !EXTRACT <nowiki>{{!NOW:yyyy/mm/dd_hhnn}}</nowiki> | ||
+ | Each time you add something to !EXTRACT either by using the [[ADD]] command or as part of a TAG...EXTRACT command, it will save that value as a separate column when using SAVEAS TYPE=EXTRACT. | ||
=== Extraction & the Scripting Interface === | === Extraction & the Scripting Interface === | ||
Line 173: | Line 192: | ||
s = Replace(s, "#NEXT#", """"+ "," + """") | s = Replace(s, "#NEXT#", """"+ "," + """") | ||
− | Related forum post: [http://forum. | + | Related forum post: [http://forum.imacros.net/viewtopic.php?f=7&t=12375&p=36478#p36478 Missing #NEXT# delimiters in .csv from web extraction] |
==== Example 1 - Transfer extracted values to calling program ==== | ==== Example 1 - Transfer extracted values to calling program ==== | ||
Line 203: | Line 222: | ||
'''Example:''' | '''Example:''' | ||
− | URL GOTO=http:// | + | URL GOTO=http://demo.imacros.net/Automate/Dialogs |
SET !EXTRACTDIALOG YES | SET !EXTRACTDIALOG YES | ||
ONDIALOG POS=1 BUTTON=OK CONTENT= | ONDIALOG POS=1 BUTTON=OK CONTENT= | ||
Line 227: | Line 246: | ||
[[!EXTRACT]] will contain the complete list of entries, separated by the keyword [OPTION] | [[!EXTRACT]] will contain the complete list of entries, separated by the keyword [OPTION] | ||
− | Related forum posts: [http://forum. | + | Related forum posts: [http://forum.imacros.net/viewtopic.php?f=13&t=12982 How to extract value from option tag instead of text] |
== Extraction and the PRE Tag == | == Extraction and the PRE Tag == | ||
Line 280: | Line 299: | ||
Example: | Example: | ||
− | URL GOTO=http:// | + | URL GOTO=http://demo.imacros.net/Automate/Extract2 |
SET !ERRORIGNORE YES | SET !ERRORIGNORE YES | ||
'Correct: TAG POS=1 TYPE=DIV ATTR=TXT:MyTable | 'Correct: TAG POS=1 TYPE=DIV ATTR=TXT:MyTable | ||
Line 290: | Line 309: | ||
===Related forum posts=== | ===Related forum posts=== | ||
− | * '''[http://forum. | + | * '''[http://forum.imacros.net/viewtopic.php?t=2219 Video Tutorial Relative Extraction]''' |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?t=3615 Three fundamental techniques of extracting a table's data] |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?f=13&t=6324 Extract Number of Google Search Results] |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?t=287 More Robust Extraction Tags] |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?t=153 Extract a table line by line] |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?f=7&t=5661&p=15799#p15799 Extracting flight prices from Expedia] |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?f=2&t=5793 Extract and parse HTML if elements are separated by <BR> only] |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?f=11&t=5881&p=16530#p16530 Nested elements: When does the search start?] |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?f=11&t=5474 Extracting data from Amazon.com] |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?f=7&t=5987 Finding anchors] |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?f=15&t=6078 How to mark and remove SPAM from web helpdesk] |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?f=7&t=6223 How to extract a certain word in paragraph?] |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?f=11&t=6774&p=19323#p19323 How to click on the last element on a page?] |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?f=13&t=15565 Extracting nested tables] |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?f=7&t=10390&start=15#p33539 Yellow Pages example] |
==Asian Language Support== | ==Asian Language Support== | ||
Line 315: | Line 334: | ||
For more details please see these forum posts: | For more details please see these forum posts: | ||
− | * '''[http://forum. | + | * '''[http://forum.imacros.net/viewtopic.php?t=2687 Solution if extraction fails on certain character encodings]''' |
− | * [http://forum. | + | * [http://forum.imacros.net/viewtopic.php?t=2783 What is Unicode, ASCII, and ANSI?] |
== See Also == | == See Also == | ||
[[!EXTRACTDIALOG]] | [[!EXTRACTDIALOG]] |
Revision as of 07:27, 13 October 2017
Data Extraction and Web Scraping
A key activity in web automation is the extraction of data from websites, also known as web scraping or screen scraping. Whether it is price lists, stock information, financial data or any other type of data, iMacros can extract this data for you and either re-use the data or store it in a file or database.
iMacros can write extracted data to standard text files, including the comma separated value (.csv) format, readable by spreadsheet processing packages. Also, iMacros can make use of the powerful scripting interface to save data directly to databases.
The Extract command
Data extraction is specified by an EXTRACT parameter in the TAG command. This parameter replaces the usual CONTENT parameter. Please see the updated Demo-Extract for some examples of this, including the following:
TAG POS=1 TYPE=SPAN ATTR=CLASS:bdytxt&&TXT:* EXTRACT=HTM
This means that the syntax of the command is now the same as for the TAG command, with the type of extraction specified by the additional EXTRACT parameter.
Creation of Extraction Tags
Extraction Wizard
The Extraction Wizard can be used to automatically generate and test extractions.
To define an EXTRACT command proceed as follows:
- Whilst in record mode, open the Text Extraction Wizard ("Text" button on the Rec tab).
- In the browser window or frame select the text that you want to extract.
- Choose what type of extraction you want to perform on that element, like TXT, HTM, HREF, ALT, TXTALL, or TITLE. Not all types are available for all elements.
- The extracted information will be displayed in the wizard. iMacros also creates a suggestion for the tag command attribute and position.
- If the result is #EANF# (Extraction Anchor Not Found) you will need to alter the extraction anchor in order to successfully extract the data.
- If you are satisfied with the result click "Add Command" to add a TAG command with the EXTRACT statement to the macro.
Table extract commands can be easily produced and checked using the Text Extract Wizard. If the element chosen is a table, the table data is properly formatted and displayed in the wizard.
Note: The extraction wizard is only available in the iMacros Browser and iMacros for Internet Explorer But the generated commands can be used in all iMacros versions.
Extraction from Framed Websites
If the information you want to extract is inside a framed web site you need to have a FRAME command to mark the frame as active for extraction.
When recording a TAG command the FRAME command will automatically be generated.
URL GOTO=http://demo.imacros.net/Automate/Frames FRAME F=5 TAG POS=1 TYPE=P ATTR=TXT:<SP>Frame5
Within the Extraction Wizard, when selecting the data to be extracted the FRAME command will automatically be generated.
URL GOTO=http://demo.imacros.net/Automate/Frames FRAME F=3 TAG POS=1 TYPE=P ATTR=TXT:* EXTRACT=TXT
Manual Creation
In order to manually create an extraction tag it is necessary to first record a TAG command. In record mode click on the data to be extracted. After stopping the macro recording, open the macro for editing and replace the CONTENT= attribute with the EXTRACT=TXT parameter (or just simply add the EXTRACT parameter to the end of command if the CONTENT parameter does not exist). If you need to extract other information than text you can use the TXTALL, HTM, HREF, TITLE, ALT or CHECKED attribute instead of TXT.
TAG POS=1 TYPE=TD ATTR=CLASS:NewLatestResultsLotto&&TXT:* EXTRACT=TXT
Starting with iMacros 10.2, you can extract any attribute. If you want to know which attributes are available, record the TAG command in Expert Mode.
Example:
Normally, the tooltips are specified by the TITLE attribute of an element. You can extract the tooltip content with iMacros:
The qTip tooltip plugin for the jQuery JavaScript framework uses not the title attribute but a custom attribute named "data-qtip":
ComputerName=* is the initial immutable part of the tooltip content and is used to specify which tooltip should be extracted. Recording in Expert Mode will create a TAG command that consists of all attributes of the clicked HTML element.
Extract Complete Website
To extract a complete web page (or the complete header or body) you need to manually insert the appropriate TAG line. Please see the examples:
URL GOTO=http://www.imacros.net 'Complete Page TAG POS=1 TYPE=HTML ATTR=* EXTRACT=HTM 'Complete Page TEXT only TAG POS=1 TYPE=HTML ATTR=* EXTRACT=TXT 'Page header only TAG POS=1 TYPE=HEAD ATTR=* EXTRACT=HTM 'Page body TAG POS=1 TYPE=BODY ATTR=* EXTRACT=HTM
Alternatively you can use the SAVEAS command to save the complete web page.
For an example using the SEARCH command, please see the following forum post: Most Efficient Way To Extract Source Code
Extract Table
Use TAG TYPE=TABLE ... to extract the content of a complete table with one command. Example: Demo-Extract-Table.
This method works well with well formatted tables. For more tricky table extractions you always have the option to extract them cell by cell as shown in the !ENDOFPAGE example.
Table cells in the extracted data are separated by the string #NEXT# and table rows are delimited by the string #NEWLINE#. These tags are automatically translated into commas and newlines when you use the "SAVEAS TYPE=EXTRACT" command, but the delimiters are retained when returning the data to a script via a call to iimGetExtract.
If the table you are attempting to extract also contains nested tables, then the inner table data will also be separated by commas and new lines (in CSV format, via SAVEAS), or #NEXT# and #NEWLINE# (via iimGetExtract).
You can use the Text Extraction Wizard to see the resulting extracted table, but in this case, for visual simplicity, the inner tables are shown as plain text, without the delimiters.
Extract Page Title
To extract a title of a website you need to manually insert the appropriate TAG line with TYPE=TITLE. This TAG command finds the page's title element. Please see the example:
URL GOTO=http://www.imacros.net/ TAG POS=1 TYPE=TITLE ATTR=* EXTRACT=TXT
Extract Page URL
To extract the URL of a website as shown in the browser address bar please use the built-in !URLCURRENT variable and store this value in !EXTRACT with the SET or ADD command.
ADD !EXTRACT {{!URLCURRENT}}
Test Popup
When manually running a macro with an extraction TAG, by default the extraction will be displayed on the screen. This facility can be switched off using the following command:
SET !EXTRACT_TEST_POPUP NO
Handling Extraction Results
If in one macro several EXTRACT commands appear then the results are separated by the string [EXTRACT].
SAVEAS
You can save extracted data directly to a CSV file by adding a "SAVEAS TYPE=EXTRACT" command manually to the macro. All items that were extracted before the SAVEAS command are saved to the specified file in one row like
"item1", "item2", "item 3", ...
As you can see the [EXTRACT] tags, which are inserted to distinguish results from different EXTRACT commands, are substituted by commas. If in the Options dialog you have checked "Use regional settings in CSV files", the "comma" between each extraction is going to be your system list separator (a semi-colon ";" for instance) instead of ",".
The SAVEAS command erases the content of the !EXTRACT variable afterward. With the next start of the macro or the next round of a loop a new line is added to the file.
Saving arbitrary values
In addition to extracting data from web pages, you can also save any arbitrary value to your output file by adding it to the extraction buffer with the following syntax:
ADD !EXTRACT <value>
Where <value> is replaced with the actual value you want to save (can also be an EVAL expression). For example, the following will add a time stamp column to the output:
ADD !EXTRACT {{!NOW:yyyy/mm/dd_hhnn}}
Each time you add something to !EXTRACT either by using the ADD command or as part of a TAG...EXTRACT command, it will save that value as a separate column when using SAVEAS TYPE=EXTRACT.
Extraction & the Scripting Interface
(Related example scripts: Extract-and-fill.vbs, Extract-2-file.vbs, Get-Exchange-Rate.vbs)
All extracted data can be sent to your code via the Scripting Interface. This gives you all the power of any programming language you choose to process the extracted information further or simply save it to a file.
Use the iimGetLastExtract command to return the extracted information from the macro.
The extracted text is returned as a string. Extracted information resulting from different extractions are separated by [EXTRACT], e.g.
Text to be extracted[EXTRACT] Salary: 33,000.00 per year[EXTRACT]...
Remember: Using the "SAVEAS TYPE=EXTRACT" command will reset the contents of the !EXTRACT variable. Thus, using this command in a macro whose extraction result you wish to obtain via the Scripting Interface will result in an empty string in your application!
If you extract a complete table the data from different columns is separated by #NEXT# and each table row ends with #NEWLINE#. You can easily use the separation tags to split the complete dataset. In Visual Basic Script this would, for example, look something like:
s = Replace(s, "#NEWLINE#", """" + vbCrLf + """") s = Replace(s, "#NEXT#", """"+ "," + """")
Related forum post: Missing #NEXT# delimiters in .csv from web extraction
Example 1 - Transfer extracted values to calling program
Use iimGetLastExtract to retrieve the values.
iplay = iim1.iimPlay("wsh-extract-rate") If iplay = 1 Then s = "One US$ costs " + iim1.iimGetLastExtract(1) + " EURO or " + iim1.iimGetLastExtract(2) + " British Pounds (GBP)" else s = "The following error occurred: " + iim1.iimGetLastError() End If
Unsuccessful Extraction
If the extraction was unsuccessful, i.e. the extraction anchor could not be found on the page, the !EXTRACT variable holds the string #EANF# (Extraction Anchor Not Found). However, the return value that informs you whether the execution of a macro was successful is still positive (iimPlay = 1). The reason for this behavior is that a macro can have many TAG...EXTRACT commands and often only one or a few of them do not find the extraction anchor. If you want to check if a particular EXTRACT command was successful you just need to check if #EANF# is present in the returned string. Often this can be very useful, for example if you use EXTRACT to check if a keyword is present on a page. A returned string containing #EANF# indicates that the keyword is not found. For comparison, if a standard TAG command cannot locate the defined element then iMacros returns an error.
Image Extraction
The Image Extraction Wizard helps you the create the right commands TAG... CONTENT=EVENT:SAVEITEM for image web scraping. Please see the Save Web Page Elements chapter for more details.
Extraction of Dialog Text
To get the text of a dialog use
SET !EXTRACTDIALOG YES
in the macro at any position before dialog appears. Now the content of a dialog is added to the extracted text, i.e. to the !EXTRACT variable.
Example:
URL GOTO=http://demo.imacros.net/Automate/Dialogs SET !EXTRACTDIALOG YES ONDIALOG POS=1 BUTTON=OK CONTENT= TAG POS=1 TYPE=INPUT:BUTTON FORM=NAME:NoFormName ATTR=VALUE:Popup<SP>1 WAIT SECONDS=3 PROMPT {{!EXTRACT}}
The PROMPT command in this example is simply used to show the extracted values. The WAIT statement is not directly required, but there has to be a 1-2 seconds delay between the time you trigger the dialog and the first time you use the extracted dialog text. The reason for this is that there is a small delay between the time the TAG command triggers the dialog (e. g. by clicking on a link) and the time the dialog actually appears. iMacros has no way of knowing beforehand that a certain link will trigger a dialog. So it has to "catch" the dialog once it appears and then handle it. Typically this whole process is fast and takes less than a second, but until it is complete the !EXTRACT variable is not filled with the text from the dialog.
See also !EXTRACTDIALOG.
Extracting From SELECT Elements
In HTML code drop down lists are generated by a SELECT tag.
With a simple EXTRACT=TXT, the currently active value is extracted:
TAG POS=1 TYPE=SELECT ATTR=TXT:*&&NAME:quantity&&VALUE:* EXTRACT=TXT
In order to extract all options in a drop down list use
TAG POS=1 TYPE=SELECT ATTR=TXT:*&&NAME:quantity&&VALUE:* EXTRACT=TXTALL
!EXTRACT will contain the complete list of entries, separated by the keyword [OPTION]
Related forum posts: How to extract value from option tag instead of text
Extraction and the PRE Tag
Some web pages make use of a <PRE ...> tag in their HTML code. It marks the enclosed text as preformatted -- all the spaces and carriage returns are rendered exactly as you type them. The information enclosed in a <PRE> tag is extracted correctly (including the formatting!) by iMacros. Thus, if you transfer the extracted data via the Scripting Interface all formatting information is retained unchanged. The formatting is only changed on two occasions: line breaks are removed when displaying the result in the test dialog box and when saving the result using the SAVEAS command. This is necessary to ensure proper formatting of the CSV formatted text file because in the CSV format a line break would start a new line.
Extract with relative Positioning
(Related example macro: Demo-ExtractRelative )
Note: For changes in the upcoming iMacros V7 please see V7_Relative_positioning. In a nutshell, the principle stays the same, but the position is now relative to the end (close tag) of the anchor element, so iMacros V7 and iMacros for Firefox extract macros are now compatible.
When extracting data from a complex website the extraction can be made easier if you can tell iMacros to start the search for the extraction anchor after a specific point on the page (as opposed to start from the top, which is the default).
E.g., assume you want to extract data from a specific cell in a table, in this case the size of the land in the second table.
Without relative positioning you would have to count the cell from the top of the page including cells from other tables that come before the land table. Although the extraction wizard can do this for you, you run into problems as soon as the number of rows in a table are not constant as they are in the above example. The Transfer table of result 1 has four rows, that of result 2 has five rows. Thus, an absolute position parameter like so
TAG POS=1 TYPE=TD ATTR=CLASS:code&&TXT:* EXTRACT=TXT
will potentially result in the extraction of an unwanted result.
With relative positioning you tell iMacros to search for the extraction anchor located after the position that is indicated by a TAG command immediately before your TAG...EXTRACT command. In our case we click on the table title "Land" before starting the extraction wizard to create a TAG command. Note that this TAG command does not click on any link, rather it only marks an element to indicate a position for the following TAG command. Relative positions are indicated with an R before the position number.
TAG POS=1 TYPE=B ATTR=TXT:Land TAG POS=R1 TYPE=TXT ATTR=CLASS:code&&TXT:* EXTRACT=TXT
- If you want to use a button or a link as reference, you should tag it with TAG ... EXTRACT=TXT, to avoid following the link or "pushing" the button. In that case, do not forget to use SET !EXTRACT NULL, to clear the extract variable before the real extract.
How to limit the extraction search range
Use !ENDOFPAGE to limit the extraction to a range above a certain trigger word or image.
Backwards relative positions: Since iMacros V6.20 you can also indicate backward positions (= to the left and/or top of an selected element). This negative relative extractions supports up to 10 backwards steps (POS=R-10).
'Negative positioning => move to the LEFT and/or TOP of the anchor TAG POS=1 TYPE=TD ATTR=TXT:31023G20080822 TAG POS=R-1 TYPE=INPUT:CHECKBOX FORM=NAME:DataDownloadActionForm ATTR=NAME:* CONTENT=YES
How to skip a missing value
If you use relative extraction and a certain data record (e. g. a phone number) is missing on a page, then the macro would normally stop with a TAG error as the TAG for the anchor fails. But that is not what you want during an extraction: You simply want the macro to continue and extract all other values that exist. => Solution: Add SET !ERRORIGNORE YES.
Note that when the anchor TAG immediately before a relative extraction fails, then the extraction itself also fails (= returns #eanf#). This is by design to make sure that iMacros extracts only the intended value (if the extraction anchor exists) or no value ("#eanf#") if the extraction anchor is not found.
Example:
URL GOTO=http://demo.imacros.net/Automate/Extract2 SET !ERRORIGNORE YES 'Correct: TAG POS=1 TYPE=DIV ATTR=TXT:MyTable TAG POS=1 TYPE=DIV ATTR=TXT:MyTableOTHERNAME TAG POS=R3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
OTHERNAME was added to the TXT:MyTable attribute to trigger the extraction anchor failure for demo purposes.
Related forum posts
- Video Tutorial Relative Extraction
- Three fundamental techniques of extracting a table's data
- Extract Number of Google Search Results
- More Robust Extraction Tags
- Extract a table line by line
- Extracting flight prices from Expedia
- Extract and parse HTML if elements are separated by
only - Nested elements: When does the search start?
- Extracting data from Amazon.com
- Finding anchors
- How to mark and remove SPAM from web helpdesk
- How to extract a certain word in paragraph?
- How to click on the last element on a page?
- Extracting nested tables
- Yellow Pages example
Asian Language Support
iMacros runs on all language version of Windows, including the so-called "double-byte" languages like Chinese, Japanese or Korean.
Asian Languages Text Extraction:
iMacros and the Scripting Interface include full Unicode support, so you can extract Asian language characters (e.g. Japanese) even on Western Windows versions (e.g. English).
For more details please see these forum posts: