Log File Analysis

Your goal is to parse a log file and do some analysis on it. The log file contains all requests to a server within a specific timeframe.

You are given the following method/url definitions:

GET /api/users/{user_id}/count_pending_messages GET /api/users/{user_id}/get_messages GET /api/users/{user_id}/get_friends_progress GET /api/users/{user_id}/get_friends_score POST /api/users/{user_id} GET /api/users/{user_id}

Where user_id is the id of the user calling the backend.

The script/program should output a small analysis of the sample. It should contain the following information for each of the URLs above:

The number of times the URL was called. Mean (average) response times (connect time + service time) Median response times (connect time + service time) The output should be a JSON string.

The log format is defined as:

{timestamp} {source}[{process}]: at={log_level} method={http_method} path={http_path} host={http_host} fwd={client_ip} dyno={responding_dyno} connect={connection_time}ms service={processing_time}ms status={http_status} bytes={bytes_sent} Example:

2014-01-09T06:16:53.916977+00:00 heroku[router]: at=info method=GET path=/api/users/1538823671/count_pending_messages host=mygame.heroku.com fwd=“208.54.86.162” dyno=web.11 connect=7ms service=9ms status=200 bytes=33 2014-01-09T06:18:53.014475+00:00 heroku[router]: at=info method=GET path=/api/users/78475839/count_pending_messages host=mygame.heroku.com fwd=“208.54.86.162” dyno=web.10 connect=8ms service=10ms status=200 bytes=33 2014-01-09T06:20:33.142889+00:00 heroku[router]: at=info method=GET path=/api/users/847383/count_pending_messages host=mygame.heroku.com fwd=“208.54.86.162” dyno=web.10 connect=7ms service=10ms status=200 bytes=33 Given the above three log lines, we would expect output like:

{ “request_identifier”: “GET /api/users/{user_id}/count_pending_messages”, “called”: 3, “response_time_mean”: 17.0, “response_time_median”: 17.0, }

Here is one approach to parse the log and generate the analysis:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import json
import re

# Parse logs and aggregate stats by URL 
stats = {}
for line in logs:
  match = re.search(r'method=(\w+) path=([^ ]+)', line)
  if match:
    method, path = match.groups()

    time = float(re.search(r'connect=(\d+)ms service=(\d+)ms', line).groups()[0]) + \
           float(re.search(r'connect=(\d+)ms service=(\d+)ms', line).groups()[1])

    if path not in stats:
      stats[path] = {'called': 0, 'times': []}

    stats[path]['called'] += 1
    stats[path]['times'].append(time)

# Generate output
output = []
for path, data in stats.items():

  mean = sum(data['times']) / len(data['times'])
  median = sorted(data['times'])[len(data['times']) // 2]  

  output.append({
    'request_identifier': f'{path}', 
    'called': data['called'],
    'response_time_mean': mean,
    'response_time_median': median
  })

print(json.dumps(output))

The key steps are:

Parse logs to extract URL and times
Aggregate stats by URL
Calculate mean and median time for each
Format as required output

This parses the raw logs, generates per-URL statistics, and formats the final output JSON.