- Add option to strip out any given domain, not just baidu.
I wanted a simple way to output some bar graphs showing how many hits each of my sites was getting every month. Even though there are tons of off the shelf solutions, I, of course, had to roll my own. I started by writing Parser.py , which reads the apache logs for a given site/month and outputs a JSON file. This JSON file has a "header" that lists the total and unique hits for that month, followed by a normal JSON formatted daily listing of every unique IP. This is scheduled to run the first of every month via cron. These files are much smaller than the corresponding apache files, but of course do not contain nearly as much information.
Next, I wrote Plotter.py to read these JSON files and output the aforementioned bar graph via matplotlib. I added an option, --nobaidu, to strip out all of the traffic coming from 180.76.15.xxx. Baidu's spider is relentless, weighing in at ~40 crawls per day. This is easily extendible to any given domain. You could even add an option that takes the IP range to exclude from the CLI.
Source for each of these: