The renewal maintenance has officially ended for Progress iMacros effective November 30, 2023.
This Wiki site will also no longer be moderated from the Progress side.
Thank you again for your business and support.
Sincerely, The Progress Team
Data Extraction
Data Extraction and Web Scraping
A key activity in web automation is the extraction of data from websites, also known as web scraping or screen scraping. Whether it is price lists, stock information, financial data or any other type of data, iMacros can extract this data for you and either re-use the data or store it in a file or database.
iMacros can write extracted data to standard text files, including the comma separated value (.csv) format, readable by spreadsheet processing packages. Also, iMacros can make use of the powerful scripting interface to save data directly to databases.
The Extract command
From version 6.0 of iMacros extraction is no longer handled by a separate command. It is now specified by an additional parameter to a TAG command. Please see the updated Demo-Extract for some examples of this, including the following:
TAG POS=1 TYPE=SPAN ATTR=CLASS:bdytxt&&TXT:* EXTRACT=HTM
This means that the syntax of the command is now the same as for the TAG command, with the type of extraction specified by the additional EXTRACT parameter.
Creation of Extraction Tags
Manual Creation in the Internet Explorer and Firefox Plug-ins
In order to manually create an extraction tag it is necessary to first record a TAG command. In record mode click on the data to be extracted. After stopping the macro recording, open the macro for editing and add the EXTRACT= parameter to the relevant TAG command.
TAG POS=1 TYPE=TD ATTR=CLASS:NewLatestResultsLotto&&TXT:* EXTRACT=TXT
Extraction Wizard
Within the iMacros Browser the Extraction Wizard can be used to automatically generate and test extractions.
To define an EXTRACT command proceed as follows:
- Whilst in record mode, open the Extraction Wizard ( "Extract Data" button on the Rec tab of the control panel).
- In the browser window or frame select the text that you want to extract.
- The marked information will be displayed in the yellowish text area on the left. iMacros also creates a suggestion for the extraction, which is displayed in the orange text field on the right.
- Click "Test EXTRACT Tag" to test run the extraction tag. The result of the generated extraction anchor will then be displayed in the yellow text area on the right hand side of the wizard. If the result is #EANF# (Extraction Anchor Not Found) you will need to alter the extraction anchor in order to successfully extract the data.
- If you are satisfied with the result click "Add this EXTRACT tag" to add the EXTRACT statement to the macro.
Beta Note: In the current beta version the POS= Parameter is not automatically set to the correct value (it is always 1). You can work around this by manually testing the different POS values. Automatic POS value detection will be added ASAP. Alternatively you can keep POS=1 and refine the other parts of the anchor. In the Google example you can use ATTR=TXT:*©* because this symbol does not change even if the year value changes.
Extraction from Framed Websites
If the information you want to extract is inside a framed web site you need to have a FRAME command to mark the frame as active for extraction.
When recording a TAG command the FRAME command will automatically be generated.
URL GOTO=http://www.iopus.com/imacros/demo/v5/frames/index.htm FRAME F=5 TAG POS=1 TYPE=P ATTR=TXT:<SP>Frame5
Within the Extraction Wizard, when selecting the data to be extracted the FRAME command will automatically be generated.
URL GOTO=http://www.iopus.com/imacros/demo/v5/frames/index.htm FRAME F=3 TAG POS=1 TYPE=P ATTR=TXT:* EXTRACT=TXT
Test Popup
When manually running a macro with an extraction TAG, by default the extraction will be displayed on the screen. This facility can be switched off using the following command:
SET !EXTRACT_TEST_POPUP NO
Handling Extraction Results
SAVEAS
You can save extracted data directly to a file by adding a "SAVEAS TYPE=EXTRACT" command manually to the macro. All items that were extracted before the SAVEAS command are saved to the specified file in one row like
"item1", "item2", "item 3", ...
As you can see the [EXTRACT] tags, which are inserted to distinguish results from different EXTRACT commands, are substituted by commas. The SAVEAS command erases the content of the !EXTRACT variable afterwards. With the next start of the macro or the next round of a loop a new line is added to the file.
Extraction & the Scripting Interface
(Related example scripts: Extract-and-fill.vbs, Extract-2-file.vbs, Get-Exchange-Rate.vbs)
All extracted data can be sent to your code via the Scripting Interface. This gives you all the power of any programming language you choose to process the extracted information further or simply save it to a file.
Use the iimGetLastExtract command to return the extracted information from the macro.
The extracted text is returned as a string. Extracted information resulting from different extractions are separated by [EXTRACT], e.g.
Text to be extracted[EXTRACT] Salary: 33,000.00 per year[EXTRACT]...
Remember: Using the "SAVEAS TYPE=EXTRACT" command will reset the contents of the !EXTRACT variable. Thus, using this command in a macro whose extraction result you wish to obtain via the Scripting Interface will result in an empty string in your application!
If you extract a complete table the data from different columns is separated by #NEXT# and each table row ends with #NEWLINE#. You can easily use the separation tags to split the complete dataset. In Visual Basic Script this would, for example, look something like:
s = Replace(s, "#NEWLINE#", """" + vbCrLf + """") s = Replace(s, "#NEXT#", """"+ "," + """")
Example 1 - Transfer extracted values to calling program
Use iimGetLastExtract to retrieve the values.
iplay = iim1.iimPlay("wsh-extract-rate") If iplay = 1 Then s = "One US$ costs " + iim1.iimGetLastExtract(1) + " EURO or " + iim1.iimGetLastExtract(2) + " British Pounds (GBP)" else s = "The following error occurred: " + iim1.iimGetLastError() End If
Unsuccessful Extraction
If the extraction was unsuccessful, i.e. the extraction anchor could not be found on the page, the !EXTRACT variable holds the string #EANF# (Extraction Anchor Not Found). However, the return value that informs you whether the execution of a macro was successful is still positive (iimPlay = 1). The reason for this behavior is that a macro can have many TAG...EXTRACT commands and often only one or a few of them do not find the extraction anchor. If you want to check if a particular EXTRACT command was successful you just need to check if #EANF# is present in the returned string. Often this can be very useful, for example if you use EXTRACT to check if a keyword is present on a page. A returned string containing #EANF# indicates that the keyword is not found. For comparison, if a standard TAG command can not locate the defined element than iMacros returns an error.
Extraction of Dialog Text
To get the text of a dialog use
SET !EXTRACTDIALOG YES
in the macro at any position before dialog appears. Now the content of a dialog is added to the extracted text, i.e. to the !EXTRACT variable.
Example:
URL GOTO=http://www.iopus.com/imacros/demo/v6/dialogs/javascript2.htm SET !EXTRACTDIALOG YES ONDIALOG POS=1 BUTTON=OK CONTENT= TAG POS=1 TYPE=INPUT:BUTTON FORM=NAME:NoFormName ATTR=VALUE:Popup<SP>1 WAIT SECONDS=3 PROMPT {{!EXTRACT}}
The PROMPT command in this example is simply used to show the extracted values. The WAIT statement is not directly required, but there has to be a 1-2 seconds delay between the time you trigger the dialog and the first time you use the extracted dialog text. The reason for this is that there is a small delay between the time the TAG command triggers the dialog (e. g. by clicking on a link) and the time the dialog actually appears. iMacros has no way of knowing beforehand that a certain link will trigger a dialog. So it has to "catch" the dialog once it appears and then handle it. Typically this whole process is fast and takes less than a second, but until it is complete the !EXTRACT variable is not filled with the text from the dialog.
Extracting From SELECT Elements
In HTML code drop down lists are generated by a SELECT tag. For SELECT boxes the currently active value is extracted.
Select currently active values:
TAG POS=1 TYPE=SELECT ATTR=TXT:*&&NAME:quantity&&VALUE:* EXTRACT=TXT
Extraction and the PRE Tag
Some web pages make use of a <PRE ...> tag in their HTML code. It marks the enclosed text as preformatted -- all the spaces and carriage returns are rendered exactly as you type them. The information enclosed in a <PRE> tag is extracted correctly (including the formatting!) by iMacros. Thus, if you transfer the extracted data via the Scripting Interface all formatting information is retained unchanged. The formatting is only changed on two occasions: line breaks are removed when displaying the result in the test dialog box and when saving the result using the SAVEAS command. This is necessary to ensure proper formatting of the CSV formatted text file because in the CSV format a line break would start a new line.
Extract with relative Positioning
(Related example macro: Demo-ExtractRelative )
When extracting data from a complex websites the extraction can be made easier if you can tell iMacros to start the search for the extraction anchor after a specific point on the page (as opposed to start from the top, which is the default).
E.g., assume you want to extract data from a specific cell in a table, in this case the size of the land in the second table.
Without relative positioning you would have to count the cell from the top of the page including cells from other tables that come before the land table. Although the extraction wizard can do this for you, you run into problems as soon as the number of rows in a table are not constant as they are in the above example. The Transfer table of result 1 has four rows, that of result 2 has five rows. Thus, an absolute position parameter like so
TAG POS=1 TYPE=TD ATTR=CLASS:code&&TXT:* EXTRACT=TXT
will potentially result in the extraction of an unwanted result.
With relative positioning you tell iMacros to search for the extraction anchor located after the position that is indicated by a TAG command immediately before your TAG...EXTRACT command. In our case we click on the table title "Land" before starting the extraction wizard to create a TAG command. Note that this TAG command does not click on any link, rather it only marks an element to indicate a position for the following EXTRACT command. Relative positions are indicated with an R before the position number.
TAG POS=1 TYPE=B ATTR=TXT:Land TAG POS=R1 TYPE=TXT ATTR=CLASS:code&&TXT:* EXTRACT=TXT
Related forum posts:
- Video Tutorial Relative Extraction
- Three fundamental techniques of extracting a table's data
- Extract Number of Google Search Results
- More Robust Extraction Tags
- Extract a table line by line
Image Extraction
Please see the Save Web Page Elements chapter.
Asian Language Support
iMacros runs on all language version of Windows, including the so-called "double-byte" languages like Chinese, Japanese or Korean.
Data Extraction:
Western (ANSI) characters can be extracted on any language version of Windows. In order to extract Asian characters correctly please run iMacros on a Windows system that supports the language. Example: To extract Chinese characters please run iMacros on the Chinese language version of Windows.