handling large json files in php

API Discussion
Heya!

Some of my cron jobs (30 second limit) are timing out because of the time it takes to load the json file in to memory and then to parse.

Just curious to how people handle/parse the large json files that we can get our hands on from Blizzard.

I'm interested in trying a json file streamer, has anyone had any success with this or has a better way of handling the 30+mb of Auction House data that we can read?

Cheers!
set_time_limit(0);
ini_set("memory_limit", "1G");
thanks Nikkel, its the cron job that times out, not the php run.

I preferably want to improve and refine the run time, instead of overcoming any current obstacles. Especially as I have 123 concurrent cron's ;) I don't think I'd last very long on my free hosting if I popped 123gb of memory requests! :D
1. Download file to temp location
2. Use a streaming parser. This is harder in PHP then you think, and you might need to roll your own. I am not aware of any existing streaming parser libs for PHP.

PHP is a poor language for such things. Using Node.js, Java or C# will definitely give you many advantages.
09/29/2018 11:25 AMPosted by cuddlywookie
Just curious to how people handle/parse the large json files that we can get our hands on from Blizzard.
I don't think I'd last very long on my free hosting if I popped 123gb of memory requests!

Your free hosting is probably overloaded. It takes me under half a second to parse a 25MB auction file.
$ time php -r 'ini_set("memory_limit","384M");echo filesize("auctions.json"),"\n";$z=json_decode(file_get_contents("auctions.json"), true);'
26859002

real 0m0.495s
user 0m0.339s
sys 0m0.147s
Thanks Erorus,

It looks from the code that the json is local & not from the url? Do you download the files & then parse them?

I've been reading the data directly from the url.

Using EU Draenor for an example, what is your timings?

http://auction-api-eu.worldofwarcraft.com/auction-data/e4a529d50fe9f24cff1ad0bf1c56c897/auctions.json

I'm using 000webhost for my hosting & I don't know how to use that unix time command to time with the hosting! :<
10/02/2018 01:57 AMPosted by cuddlywookie
It looks from the code that the json is local & not from the url? Do you download the files & then parse them?
Well, yeah. I have one job that handles getting the files on disk, and a separate job that parses them.

What's the slow bit for you, getting the file or parsing the json? Since you were talking about a streaming json parser I thought you determined the parse was slow, but now I suspect it's the download time that's getting you.

Using EU Draenor for an example, what is your timings?

http://auction-api-eu.worldofwarcraft.com/auction-data/e4a529d50fe9f24cff1ad0bf1c56c897/auctions.json
Using gzip transfer encoding, it takes under 5 seconds to download and parse that file:
$ time curl -H 'Accept-Encoding: gzip' 'http://auction-api-eu.worldofwarcraft.com/auction-data/e4a529d50fe9f24cff1ad0bf1c56c897/auctions.json' | zcat | php -r 'ini_set("memory_limit","384M"); $json=stream_get_contents(STDIN); $auc=json_decode($json, true); echo count($auc["auctions"]), " auctions read\n";'
154177 auctions read

real 0m4.295s
user 0m0.613s
sys 0m0.438s

You are using gzip transfer encoding, right? Because downloading that file straight takes me over 30 seconds:
$ time curl 'http://auction-api-eu.worldofwarcraft.com/auction-data/e4a529d50fe9f24cff1ad0bf1c56c897/auctions.json' | php -r 'ini_set("memory_limit","384M"); $json=stream_get_contents(STDIN); $auc=json_decode($json, true); echo count($auc["auctions"]), " auctions read\n";'
154177 auctions read

real 0m36.721s
user 0m0.543s
sys 0m0.755s

If you have a 30M file to download in under 30 seconds, it's gotta happen at a rate faster than 1MB/sec. Since the CDN doesn't typically go that fast, you gotta transfer less data. When using gzip transfer encoding, that 28.7MB file gets transferred using 2.5MB on the wire.
"You are using gzip transfer encoding, right? "

Er... :$

Haha, no :<

I'm sorry my original question wasn't clear. I'm a beginner coder, that has taught myself HTML, then MySQL, then PHP and I'm putting it all together to push the boundaries of my knowledge in this fan site, I host on free hosting (until I've built the site/can afford some better hosting).

I've never even head of gzip before now. I was reading the URL and decoding it or not, based on a look up of the updated date in the AH json file.

Then looping through that decoded json (that's sat in memory) to filter just the specific IDs I want.

Putting all the data in to one big insert statement, that puts it on to my MySQL DB.

I'll look at downloading the files via curl (just after I learn oauth *cries*), thank you!

If I could ask one more question. $ time curl -H 'Accept-Encoding: gzip'
This format looks different than I've experienced before. Is this actually PHP or something from unix command line? Is it something that I can put on a .php and copy/paste?

I don't think I have any access to command prompts, just an FTP connection & the host's webportal.

Sorry if these are dumb questions, it's all very new to me & terminology is not my forte yet!

Thanks!
No worries, and sorry if I'm a little abrasive sometimes. Typing into boxes long enough does that to me. :)

The stuff I've been pasting in has all been command line examples, using curl to download files and running php on the command line to parse it, just to show timings. You don't have to load the files that way.

How are you downloading the files, just using file_get_contents($url)? That's the most common way to start, but not very flexible.

Look into the php curl functions for downloading from URLs: https://secure.php.net/manual/en/book.curl.php

You can use something like:
curl_setopt($ch, CURLOPT_ENCODING, "gzip")
to have curl use gzip transfer encoding.
Thanks again for the prompt reply and support/advice.

You weren't abrasive at all! :) far from it, sorry, my intention in my previous post was to set the scene. That I don't know all the correct terminology and methods, but will try and read/experiment/learn!

Here's the code, gosh this feels embarrasing! Like showing someone a painting that you've done. When the person you're showing it to is Bob Ross & all you've managed is to do a potato print, whilst accidentally eating half the potato.


$sql = array();

$apijson = 'https://eu.api.battle.net/wow/auction/data/'.$servName.'?locale=en_GB&apikey=APIKEYHIDDEN';
$apicontents = file_get_contents($apijson);
$apidata = json_decode($apicontents,true);

$lastmod = $apidata['files'][0]['lastModified'];

$url = $apidata['files'][0]['url'];
$urlcontents = file_get_contents($url);
$urldata = json_decode($urlcontents,true);
$auctions = $urldata['auctions'];

if ($datedatacollected < 1) {
$datedatacollected = 1;
} else {
$datedatacollected;
}


if ($lastmod > $datedatacollected) {
foreach($auctions as $val) {
if ($val['item'] == 82800) {
$sql[] = "(".$lastmod.",'".addslashes($val['owner'])."','".$val['petSpeciesId']."', '".addslashes($servName)."', '".$val['buyout']."','".$region."','".addslashes($val['ownerRealm'])."')";
}
}

$sql1 = "INSERT INTO `ah` (datedatacollected, seller, petSpeciesId, server, buyout, region, realrealm) VALUES ".implode(',', $sql).";";
$mysqli->query($sql1);
}

$mysqli->close();


I'll take a peak at the manual for details on cURL. Does cURL have native support in PHP or will it be an extension that my provider will have to enable? I'm on "backwater" free hosting, they don't give us much for free! :)
10/03/2018 01:05 AMPosted by cuddlywookie
Here's the code, gosh this feels embarrasing! Like showing someone a painting that you've done. When the person you're showing it to is Bob Ross & all you've managed is to do a potato print, whilst accidentally eating half the potato.
If you're doing it right, you're always learning, and that feeling never fully goes away. ;)
I'll take a peak at the manual for details on cURL. Does cURL have native support in PHP or will it be an extension that my provider will have to enable? I'm on "backwater" free hosting, they don't give us much for free! :)
Curl is an extension for PHP but it's a pretty common one, I think even most free hosts will have it preinstalled.

I put this together for you to use in place of file_get_contents:
function gzipGet($url) {
$ch = curl_init();

curl_setopt_array($ch, [
CURLOPT_URL => $url, // Fetch this URL
CURLOPT_FOLLOWLOCATION => true, // Follow redirects
CURLOPT_MAXREDIRS => 3, // Follow this many redirects at most
CURLOPT_RETURNTRANSFER => true, // CURL exec returns data into variable
CURLOPT_SSLVERSION => 6, // Use TLSv1.2 for SSL connections, required by battle.net
CURLOPT_TIMEOUT => 20, // Seconds
CURLOPT_ENCODING => 'gzip', // Use gzip transfer encoding, if available
]);

$data = curl_exec($ch);
curl_close($ch);

return $data;
}

So, $urlcontents = gzipGet($url); should fetch it with gzip transfer encoding.
Thank you so much!

I shall tinker, deconstruct and report back in a few days!
sorry for the delay, had some irl things to deal with.

the idea you came up with and what you wrote, worked very well.

thank you!

now on to understanding oauth2!
Just be aware not all endpoints on the new API's are currently getting compressed. Obviously keep requesting compression because it should be implemented at some point.

I've only tried a handful across WoW and Diablo and so far it's only the auction data being compressed, which is great because that's probably the one that needs it the most.

I spent a couple hours trying to figure out what I was doing wrong as I am developing something in a new language and was banging my head against the wall thinking what am I doing wrong, why isnt it coming in compressed?

Didn't occur to me that I should check to see if everything was being done correctly from the sender.

Join the Conversation

Return to Forum