While preparing the demo for my presentation tomorrow at the UKOUG Tech14 conference, I noticed that my earlier posts lacked some info. For instance, I never gave the data needed for the scraper configuration table, which actually it took me some time to figure out again. Also I already noticed, a while ago that Wikipedia switched from HTTP to HTTPS, which added an additional threshold to get it working again.
So hereby a little bit of re-write of the code, the content needed for the configuration table and additional info to get it up and running, if you want to try it out at home. This post also contains a link to a Oracle dumpfile created via IMPDP that can be used to load the whole demo environment in one go, including some sample data I gathered today from my hotel room (just in case WiFi will not work at the ICC conference center in Birmingham).
Gather data via HTTPS
Luckily Tim Hall got to the rescue and described the solution neatly (as always) so you only have to follow up his post regarding getting:
- get the certificates needed
- create an Oracle wallet
- add the certificates to the wallet
The needed info can be found on Tim’s posts:
Follow it up to the letter, and you will be fine. The HTTPURITYPE calls and/or other UTL_xxxx calls outside the database will work for HTTPS data traffic.
JSON_SCRAPER_CONFIG
The following insert of one row is needed to grab the JSON content from Wikipedia via HTTPS. See also the SQL statement script added at the end of this post might WordPress scramble all the ampersands characters, etc. (and/or use the dumpfile to import all content)
--
REM INSERTING into JSON_WIKI.JSON_SCRAPER_CONFIG
SET DEFINE OFF;
--
INSERT INTO JSON_WIKI.JSON_SCRAPER_CONFIG
(URL_JSON_API
,URL_PARAMETER_STATIC
,TARGET_TABLE
,TARGET_TABLE_SCHEMA
,STARTED
,URL_PARAMETER_DYNAMIC)
VALUES
('https://en.wikipedia.org/w/api.php','?rawcontinue&format=json&utf8&action=query&list=allimages&ailimit=1&aiprop=dimensions|mime|url|user|comment|timestamp|canonicaltitle|mediatype|metadata|commonmetadata|extmetadata|bitdepth|sha1&
amp;aifrom=','JSON_SCRAPED_CONTENT','JSON_WIKI','FALSE','aifrom=');
--
Download Content
Most easily, the whole environment can be created, with all the demo material, via importing the dumpfile.
In SQL create a user, for example the user I used (JSON_WIKI) or a new one, like “JSON”. Grant the user DBA privileges (for testing/demonstration purposes/ease) and import the dumpfile via using IMPDP, in a database with a version equal or newer than 12.1.0.2
The following (additional) content can be downloaded…
Dumpfile with all the content in schema JSON_WIKI (including sample data from WikiPedia):
Statements used to create additional “JSON” indexes and JSON text index:
Insert data in table JSON_SCRAPER_CONFIG:
Create ACL permissions to grab content outside the database:
HTH/M