Be aware (!), while following thru on the following, that there might be a change you might be blocked (seen as a DOS thread) from Wikipedia sites. There is a Wikipedia API:Etiquette page that shows some of the limitations already in place, but given how I created the API, I won’t hit any request limit as far as I can see…
Ok, the last bits and pieces, creating the JSON “Scraper” package and its body.
Scraping the bits and pieces
As explained, the request will give a aicontinue value for the next picture in line. I also need a time setting/lag to be not that aggressive for the Wikipedia API. Although initially inspired by Gerard Venzl’s posts about alternative code solutions and GitHub code, I didn’t want a [CTRL-BREAK] kind of solution to stop the processing.
The package description (scrapper & sleep)
--
CREATE OR REPLACE PACKAGE JSON_SCRAPER
AS
-- --------------------------------------------------------------------------
-- Version : 1.0
-- Created : Marco Gralike (MG)
-- Purpose : JSON Wikimedia Scraper demo
-- --------------------------------------------------------------------------
PROCEDURE JSON_WIKI_SCRAPER(
p_maxruntime NUMBER DEFAULT 60000 ,
p_sleep NUMBER DEFAULT 1000,
p_page_reference VARCHAR2 DEFAULT 'oracle',
p_output boolean DEFAULT false);
-- --------------------------------------------------------------------------
-- Purpose:
-- Scraping JSON content from the Wikipedia / MediaWiki API
-- Maxruntime and sleep are in milliseconds. 60000 = 1 min, 1000 = 1 second.
-- --------------------------------------------------------------------------
PROCEDURE SLEEP(
x_millis IN NUMBER);
-- --------------------------------------------------------------------------
-- Purpose:
-- Alternative SLEEP implementation without the need for SYS grants
-- x_millis represents milliseconds, so a number like 15000 is equal to 15s.
-- --------------------------------------------------------------------------
END JSON_SCRAPER;
/
--
The package body has the following content
--
CREATE OR REPLACE PACKAGE body JSON_SCRAPER
AS
-- ----------------------------------------------------------------
PROCEDURE JSON_WIKI_SCRAPER(
p_maxruntime NUMBER DEFAULT 60000 ,
p_sleep NUMBER DEFAULT 1000,
p_page_reference VARCHAR2 DEFAULT 'oracle',
p_output BOOLEAN DEFAULT false)
AS
--
v_semaphore json_scraper_config.started%type;
v_child_name_escaped scraped_content_continue.child_name_escaped%type;
v_pkid scraped_content_continue.pkid%type;
v_count NUMBER DEFAULT 0;
v_url VARCHAR2(32767);
--
BEGIN
--
IF (p_output) THEN
dbms_output.enable (buffer_size => NULL);
END IF;
--
UPDATE json_scraper_config
SET started = 'TRUE'
WHERE target_table = 'JSON_SCRAPED_CONTENT'
AND started <> 'TRUE';
--
COMMIT;
--
LOOP
-- -----------------------------------------------------------------
-- Keep running until column started is updated with value!='FALSE'
-- -----------------------------------------------------------------
SELECT started
INTO v_semaphore
FROM json_scraper_config;
--
EXIT
WHEN v_semaphore <> 'TRUE';
--
v_count:=v_count+1;
--
IF (v_count*p_sleep) >= p_maxruntime THEN
EXIT;
END IF;
--
IF v_count=1 THEN
--
SELECT url_json_api
||url_parameter_static
||p_page_reference
INTO v_url
FROM JSON_SCRAPER_CONFIG;
--
INSERT INTO json_scraped_content
(json
)
SELECT httpuritype(v_url).getBlob() AS "JSON" FROM dual;
--
ELSE
--
SELECT child_name_escaped,
pkid
INTO v_child_name_escaped,
v_pkid
FROM scraped_content_continue
ORDER BY pkid DESC
FETCH FIRST 1 ROWS ONLY;
--
IF (p_output) THEN
dbms_output.put_line('Escaped format['||v_pkid||']: '||v_child_name_escaped);
END IF;
--
SELECT REPLACE(url_json_api
||url_parameter_static,url_parameter_dynamic,url_parameter_dynamic
||v_child_name_escaped)
INTO v_url
FROM JSON_SCRAPER_CONFIG;
--
INSERT INTO json_scraped_content
(json
)
SELECT httpuritype(v_url).getBlob() AS "JSON" FROM dual;
--
END IF;
--
COMMIT;
SLEEP(p_sleep);
--
IF (p_output) THEN
dbms_output.put_line('Running for '||(v_count*p_sleep/1000)||' seconds...');
dbms_output.put_line('URL: '||v_url);
END IF;
--
END LOOP;
--
END JSON_WIKI_SCRAPER;
-- ----------------------------------------------------------------
-- Alternative SLEEP method without need for SYS privileges
-- ----------------------------------------------------------------
PROCEDURE SLEEP(
x_millis IN NUMBER)
AS
LANGUAGE JAVA NAME 'java.lang.Thread.sleep(int)';
-- ----------------------------------------------------------------
END JSON_SCRAPER;
/
--
Implementing a sleep method this way, is that I won’t need additional privileges on packages in the database.
- In the first bit of the code I grab the “status” from the JSON_SCRAPER_CONFIG table to see if I, the package, is running. If not, I update the status column in the JSON_SCRAPER_CONFIG table to “TRUE”. As long as the PL/SQL package is running, and not is exits based on a set time limits restriction, it will run (loop!) continuously until the status column has a value other then “TRUE”.
- The first time, I will grab the initial starting values, parameters, from the config table, fetch the first Wikipedia JSON content and insert the JSON content in table JSON_SCRAPED_CONTENT and then…
- Second time around, etc., grab continuously the content and info needed from the SCRAPED_CONTENT_CONTINUE view based on the Top-N Query returning always the value for the last “parent” and “child” picture info,
- …in between I created a “sleep” period (default 1 second) before the package loops and gets the next JSON content via actions described in point 3.
And that’s it.
More or less. It’s probably also good to notice that I use the URL_PARAMETER_DYNAMIC value in the JSON_SCRAPER_CONFIG table as the needed info to change the &aifrom={parent picture} into the &aifrom={child picture} (which is the parent value the next time around).
And really, that’s it. 😉
Running the package now via the following (for 1 hour total, 1 second sleep in between)…
--
DECLARE
P_MAXRUNTIME NUMBER;
P_SLEEP NUMBER;
P_PAGE_REFERENCE VARCHAR2(200);
P_OUTPUT BOOLEAN;
BEGIN
P_MAXRUNTIME := 3600000;
P_SLEEP := 1000;
P_PAGE_REFERENCE := 'a';
P_OUTPUT := true;
JSON_SCRAPER.JSON_WIKI_SCRAPER(
P_MAXRUNTIME => P_MAXRUNTIME,
P_SLEEP => P_SLEEP,
P_PAGE_REFERENCE => P_PAGE_REFERENCE,
P_OUTPUT => P_OUTPUT
);
--rollback;
END;
/
--
…creates 3599 rows with JSON content in my JSON_SCRAPED_CONTENT table, which shows like this, via SQL*Developer directly (remember BLOB column?!)
…but like this via view SCRAPED_JSON_4000_MAX…
Remember the chopping of content bigger than varchar2(n) and replacing it with NULL in the JSON_TABLE columns section?
… and like this via my created view SCRAPED_CONTENT…
Stop! Security?
One thing of course is good to know that before you can run the package, which basically uses via HTTPURITYPE package UTL_HTTP under the covers which uses… anyway, you must create DBMS_ACL_ADMIN ACL’s and granted access. Know that you can do this via the following code (based on the fact that the user/schema is called JSON_WIKI):
--
BEGIN
DBMS_NETWORK_ACL_ADMIN.drop_acl ( acl => 'json_wiki.xml');
COMMIT;
END;
/
--
BEGIN
dbms_network_acl_admin.create_acl ( acl => 'json_wiki.xml'
, description => 'Allow external access'
, principal => 'JSON_WIKI'
, is_grant => TRUE
, privilege => 'connect'
, start_date => SYSTIMESTAMP
, end_date => NULL);
COMMIT;
END;
/
--
BEGIN
dbms_network_acl_admin.add_privilege ( acl => 'json_wiki.xml'
, principal => 'JSON_WIKI'
, is_grant => TRUE
, privilege => 'resolve' );
COMMIT;
END;
/
--
BEGIN
dbms_network_acl_admin.assign_acl( acl => 'json_wiki.xml',
host => '*' );
COMMIT;
END;
/
--
COMMIT;
--
And that’s really it for now.
Readme?!
Here is the needed “scrapper.sql” or if you want to have a look now the text extension version:
- json_scraper.sql (text file)
- json_scraper.sql (zip file)
So don’t forget to follow the initial remarks mentioned in the file
- create a database user JSON_WIKI
- grant dba, xdbadmin to JSON_WIKI
- connect as JSON_WIKI
- run the SQL file
- execute the dbms_acl_admin portion mentioned
- run your JSON scraper…
That should do it.