Voronoi Diagrams are an essential visualization to have in your toolbox. The diagrams structure is a data-driven tessellation of a plane and may be colored by random or to add additional information.
The set of points that generate the Voronoi diagram are called seeds or generators for they generate the polygon shapes; in practice, the seeds are your cleaned data. Each seed generates its own polygon called a Voronoi cell, and all of 2-dimensional (2D) space is associated with only one cell. A Voronoi cell, or Voronoi region denotes all of the space in the plane that is closest to its seed. Cell boundaries shared between two regions signify the space that is equally distant, or equidistant, to the seeds.
Given a distance metric dist and a dataset D of n 2-dimensional generator coordinates, a Voronoi diagram partitions the plane into n distinct regions. For each seed k in D, a region R_{k} is defined by the Region equation.
In English, the equation is This region is equal to the set of points in 2D space such that the distance between any one of these points and this generator is less than the distance between the point and all other generators. To fully understand the math, be sure to map the words to every symbol in the equation. Its important to recall that a generator, D_{k}, comes from the input data, whereas points in its region, R_{k}, is the output.
The Voronoi Diagram is the concatenation of n regions generated by the seeds. Importantly, note that if the regions generator is on the edge of the seed space and has no possible intersections, it will extend to infinity. This extension occurs because regions are defined as all points nearest to a single seed and not the others so, since 2D space has infinite coordinates, there will always be infinite regions.
With an idea of what Voronoi diagrams are, we can now see how to make your own in Python. While we wont cover the algorithms to find the Voronoi polygon vertices, we will look at how to make and customize Voronoi diagrams by extending the scipy.spatial.Voronoi functionality.
Before, Voronoi diagrams were defined as the concatenation of regions (Region Eq.) generated by the seeds. Regions were polygonal subsets of 2D space, i.e. a collection of points with a boundary, composed of points closest to the regions seed. In practice, Voronoi cells are usually represented through their polygon vertices (Voronoi vertices) rather than by a collection of points; this makes sense if you consider the inefficiency of directly storing points.
To illustrate what Voronoi diagrams are and how to make them, lets start with a very simple dataset: the four corners of the unit square. Each corner will be a seed, so there will be four Voronoi cells.
Seeds are colored as blue dots at the corners of the unit square, dotted lines represent edges of an infinite polygon, and orange dots are Voronoi vertices. As for an interpretation, this diagram labels all of 2D space into two categories: any point on the plane is either closest to one of four corners in the unit square (i.e. within a square), or it is equally distant to at least two corners (i.e. along a line or on an orange dot). Since all of the cells are on the edge with no possible intersections, there are no finite polygons.
Before we add one more point to the square to get a finite region, lets show the Python code to generate these simple diagrams.
from scipy.spatial import Voronoi, voronoi_plot_2d
import matplotlib.pyplot as plt
%matplotlib inline
# Calculate Voronoi Polygons
square = [(0, 0), (0, 1), (1, 1), (1, 0)]
vor = Voronoi(square)
def simple_voronoi(vor, saveas=None, lim=None):
# Make Voronoi Diagram
fig = voronoi_plot_2d(vor, show_points=True, show_vertices=True, s=4)
# Configure figure
fig.set_size_inches(5,5)
plt.axis("equal")
if lim:
plt.xlim(*lim)
plt.ylim(*lim)
if not saveas is None:
plt.savefig("../pics/%s.png"%saveas)
plt.show()
simple_voronoi(vor, saveas="square", lim=(-0.25,1.25))
Conveniently, scipy.spatial utilizes an optimized computational geometry library to compute Voronoi tessellations, so we wont have to worry about those details. All you need to know is the Voronoi object vor
contains all of the Voronoi information (regions, ridges, vertices, etc.) for plotting diagrams. And scipy.spatial.voronoi_plot_2d plots this information into a Voronoi diagram.
The only required input to plot a diagram through simple_voronoi
is a list of coordinate tuples or the seeds that generate the Voronoi cells. So, by adding a single point in the center of the unit square we can make a finite region.
The legend is the same as before except there are now filled lines, Voronoi ridges of a finite region. In the center lies the only finite Voronoi cell, the unit diamond. Any point within this diamond is closest to the center seed (0.5, 0.5) and no other. Notice that the addition of a single generator produced three more Voronoi vertices, but the number of cells always equals the number of generators.
Looking only at the two figures so far, you may think that Voronoi diagrams are always symmetrical. But if the seed space is more complicated, then its Voronoi representation will become less ordered. Lets add five random points (n_randoms
) to the unit diamonds seeds and see the resulting diagram; four separate random additions are done.
import numpy as np
n_randoms = 5
for i in range(4):
randos = five + np.random.rand(n_randoms, 2).tolist()
vor = Voronoi(randos)
simple_voronoi(vor, lim=(-1.5,2.5), saveas="rand%s"%i)
Produces the following:
Adding random seeds produces an asymmetrical Voronoi diagram, which better simulates what you may see for real-life data. However, our current visualization procedure for irregular generators is lackluster and complicated-looking, especially for greater than 10 seeds. The clearest way to better the images would be to remove any unnecessary features (e.g., dots on vertices) and to color the cells.
In the next section, we modify the Voronoi object to later improve our Voronoi visualization to produce more attractive images.
In scipy.spatials Voronoi object, a finite region is enumerated as a list of Voronoi vertices. And, recall that the outermost regions of any Voronoi diagram may extend infinitely. To represent these infinite regions, the Voronoi object uses a -1 instead of a coordinate to denote an open-faced polygon.
While its more accurate to have infinite regions be defined as an open-faced polygon, there must be a finite cutoff somewhere to be able to visualize the diagram. In our earlier diagrams, we simply limited the image frame to a square that included enough information; in some sense, this is like closing the infinite regions open face with the edge of the image. Although the limitation of the image itself works fine, if you want to color the Voronoi regions, the regions data structure must be finite.
Luckily, a function has already been made to finitize scipy.spatials Voronoi polygons. The function, voronoi_finite_polygons_2d
, receives the Voronoi object and adds a far-away vertex to each infinite region to close its open-faced polygon; star it on GitHub if you find it useful.
Its now time to code a method to generate our input seeds. Since were only concerned with visual customization, a function to make n random Voronoi polygons will suffice.
def voronoi_polygons(n=50):
random_seeds = np.random.rand(n, 2)
vor = Voronoi(random_seeds)
regions, vertices = voronoi_finite_polygons_2d(vor)
polygons = []
for reg in regions:
polygon = vertices[reg]
polygons.append(polygon)
return polygons
Numpy.random.rand(n, m) produces an n by m array of uniformly-distributed random numbers (random_seeds
); in our case, it will contain n 2D coordinates with values between 0 and 1. A Voronoi object is created from random_seeds
and then passed to voronoi_finite_polygons_2d. The finite polygons are listified, so each element in polygons
is a list of vertices.
The now finite Voronoi cells are ready to be colored. Remember that colors can be encoded as a "rgba" vector specifying degrees of red, green, blue, and alpha. Alpha is the opacity parameter where more alpha yields more opacity; 0 is fully transparent, 1 is fully opaque. We can create a function to randomly generate rgba vectors by simply returning 4 random numbers.
import random
def random_color(as_str=True, alpha=0.5):
rgb = [random.randint(0,255),
random.randint(0,255),
random.randint(0,255)]
if as_str:
return "rgba"+str(tuple(rgb+[alpha]))
else:
# Normalize & listify
return list(np.array(rgb)/255) + [alpha]
The as_str
keyword determines the return type since some libraries use different conventions for rgba vectors. To test and demonstrate random_color
, test_random_color
plots randomly-colored slits with an increasing alpha (left to right).
def test_random_color(gridsize=(5,100), figsize=(12,8)):
fig, axarr = plt.subplots(*gridsize, figsize=figsize)
for i, ax in enumerate(axarr.flatten())
# Remove ticks
ax.axes.get_xaxis().set_visible(False)
ax.axes.get_yaxis().set_visible(False)
# Calculate alpha as normlized column #
alpha = (i % gridsize[1]) / (gridsize[1] - 1)
ax.set_facecolor(random_color(as_str=False, alpha=alpha))
plt.subplots_adjust(hspace=0, wspace=0)
plt.savefig("../pics/test_random_color.png")
plt.show()
test_random_color()
Produces the following:
Now that we have a list of polygons and a coloring mechanism, it's time to bring it all together on Matplotlib.
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
def plot_polygons(polygons, ax=None, alpha=0.5, linewidth=0.7, saveas=None, show=True):
# Configure plot
if ax is None:
plt.figure(figsize=(5,5))
ax = plt.subplot(111)
# Remove ticks
ax.set_xticks([])
ax.set_yticks([])
ax.axis("equal")
# Set limits
ax.set_xlim(0,1)
ax.set_ylim(0,1)
# Add polygons
for poly in polygons:
colored_cell = Polygon(poly,
linewidth=linewidth,
alpha=alpha,
facecolor=random_color(as_str=False, alpha=1),
edgecolor="black")
ax.add_patch(colored_cell)
if not saveas is None:
plt.savefig(saveas)
if show:
plt.show()
return ax
In summary, plot_polygons
configures the axis to plot on, adds the colored polygons, then saves/shows the diagram. Notably, there are two alpha parameters: one in random_color
, the other in Polygon
. To avoid unintended results, setting random_color
's alpha to 1 ensures that the parameterized Polygon
alpha behaves properly.
The function is designed to easily visualize the Voronoi diagram from the input polygons. Therefore, only one line of code is required to get the good-looking plot.
plot_polygons(voronoi_polygons())
Produces the following:
While the image is exactly what we wanted, let's demonstrate the capabilities of plot_polygons
like we did with random_color_test
. Each plot in the grid will be a randomly- seeded, randomly-colored Voronoi diagram at a certain seed number and alpha value.
def test_plot(gridsize=(5,5), figsize=(12,8)):
fig, axarr = plt.subplots(*gridsize, figsize=figsize)
for i in range(axarr.shape[0]):
for j in range(axarr.shape[1]):
ax = axarr[i,j]
# Calc. alpha
alpha = (j % gridsize[1]) / (gridsize[1] - 1)
if alpha==0: alpha+=0.1 # Set lower bound
num_seeds = 4**(i+1)
plot_polygons(voronoi_polygons(n = num_seeds),
ax,
alpha = alpha,
show = False)
ax.set_aspect("equal","box")
if j == 0: ax.set_ylabel("%d"%(num_seeds), size="large", rotation=90)
if i == 0: ax.set_title("%.2f"%(alpha))
fig.text(x=0.5, y=0.95, s="Alpha", size=20)
fig.text(x=0.1, y=0.50, s="# of Seeds", size=20, rotation=90)
plt.savefig("../pics/test_plot.png")
plt.show()
test_plot()
Produces the following:
As the axis labels explain, the rows have an increasing number of seeds where columns have an increasing alpha.
After learning what Voronoi diagrams are, we learned how to create them. Then, we finitized the data structure to improve its visualization. Since data in this tutorial was randomly-generated, it might be difficult to conceive of practical applications. However, the functions outlined here were designed to be adaptable to new applications. I encourage you to build from here.
]]>Lets say below are the file contents of your file.
This is line 1
This is line 2
This is line 3
This is line 4
This is line 5
When executed without the -c option the output on terminal would be as follows:
This is line 5
This is line 4
This is line 3
This is line 2
This is line 1
(The order of lines are reversed, fifth line then fourth line then third and so on.) When you run the program with -c option the output would be as follows
After reversing the order of lines, the characters of each line are also reversed. Your homework should you choose to accept it (Of course you have too :P) is to implement this and also implement a -r option. Everything would work the same until the user uses -r option instead of -c. When -c option is used then you would not swap each character of a line rather you would reverse each word. So when used the -r the output would be as follows:
The words are swapped/reversed in each line (last word then second last then third last and so on).
You can assume the following things:
The user will only use file option.
The file will only contain letters, numbers and spaces.
]]>The first few paragraphs from my paper that I am working on:
How To Compute Without Variables, Logic Operators or Measurements
Work in progress
I have come up with what I believe is a new type of logic, it is mechanical in nature, but because I could maybe see it done on the quantum level it could be far more complex than any mechanical machine has ever been, in ways not seen in physical mechanical devices.
This logic is pure connectionism, only using connections and nothing else. I look at it as a geometry of logic. My system performs the logic only using one command without numeric variables, without logic operators, and without measurements. This is neither digital nor analog logic.
This is not theory, I have built a working model using this logic that demonstrates if-then, do-while, a randomizer, a relational database and other logic, including a rudimentary calculator that adds/subtracts/multiplies/divides. In the working model I only use one command for hooking in the input, a few commands for output, but all the logic in between is one command that does nothing but link commands together.
The logic demonstrated in this model uses the command alias, which is used in a FPS video game called Counter-Strike, which is a modification for a video game made by Valve called Half-Life, which is based on id Software's QuakeWorld engine. This command is used to link various commands together creating a new command that executes a command string, to provide a way for customizing the interface of the game. This logic requires input and output provided in the game - which, no doubt, uses Boolean logic to perform, but the logic itself is contained to using the one command alias and does not use Boolean logic.
A readable-online version of the paper (no download, unless you want a Word copy) https://app.box.com/s/4plplfbrhwr9qflosp8tir00r0pf1467
You can also find the paper here, but you have to download it to read it: https://github.com/johnvlilley/Stateless-Computer
I suggest you start with the simple version of the calculator that does just add and subtract: https://github.com/johnvlilley/Stateless-Computer/blob/master/calculator_simple.cfg
And the complex calculator has much better inline commenting: https://github.com/johnvlilley/Stateless-Computer/blob/master/calculator_complex.cfg
I am interested in taking some of the logic, perhaps the most complicated part - the grenade throwing script, and visually re-creating it in Minecraft. This part of the logic performs the permutations of a math question I came up with and was answered by using the ancient Chinese Pascal's Triangle in a new way. It is similar to the question of how many combinations of 4 hats on 4 pegs you can have, and I just had to count the pegs as part of the permutation where they did not.
Here is the question:
"You have a combination padlock with four dials on it. Each dial has the numbers 0 through 4 on them. The lock can have as many 0s as dials, and is set to 0000 by default. The lock does not allow you to use any number between 1 and 4 two or more times in the combination. The following combinations are valid: 0123 1234 0103 0010 4031. The following combinations are invalid: 0113 4014 0202 4444. How many possible combinations are there?"
The solution to this word problem is here, notice that it is a new use for Pascals Triangle, because it values nothing as something:
http://mathhelpforum.com/discrete-math/17147-combination-lock.html
I desperately need help writing the paper, it is obvious I never have written one before. I can't offer any money, because I plan to put this in the public domain. I will certainly give credit where credit is due, though. Please contact me at johnphantom@hotmail.com or the more reliable johnvlilley@gmail.com
]]>NEED HELP WITH >>>>>>>>>>>>>>>>> variable lists, IPO Chart, pseudocode, Flowchart
]]>This tutorial provides guidance on gathering data through web-scraping. However, to demonstrate the real-life issues with acquiring data, a deep-dive into a specific, complicated example is needed. The problem chosen, acquiring the geographic coordinates of gas stations in a region, turns into an interesting math problem that, eventually, involves "sacred geometry".
No matter which Data Science Process model you subscribe to, actually acquiring data to work with is necessary. By far the most straightforward data source is a simple click-to-download in a standardized file format so you can utilize a parsing module in your favorite language; for example, using Pythons pandas.DataFrame.from_csv()
function parses a .csv into a DataFrame object in one line .
Unfortunately, its not always this easy. Real-time data, like the stream of 6000 tweets per second, cant simply be appended to an infinitely-growing file for downloading. Furthermore, what happens when the dataset is extremely large? Smaller organizations might not be able to provide several-gigabyte downloads to each user, and if someone only needs a small subset, then giving them everything would be inefficient.
In general, these problems are solved through an Application Programming Interface (API). APIs are a programmers way of interfacing with an application, or in the context of this article, the means by which we will acquire data. Here's a great, concise resource on why APIs are needed .
One quick note. APIs are typically different from one another, so its not necessary to learn every detail, and are you not expected to. Youll gain experience by using them.
To see how data gathering can look in practice, this article will demonstrate a hands-on approach to find and deal with an API Ive never used before.
Before any searching, specify your intentions. For our case, the goal is to get the location, latitude and longitude, of gas stations in the United States in a Python pandas.DataFrame. Notably, it took some googling life hacks to find a free, comprehensive source that meets our requirements. Its a little old, but myGasFeed.com has the data were looking for.
myGasFeed was an API a developer would use to get the local gas prices of an area for a website, mobile app, etc.
Anytime you acquire data programmatically, look for any rules somewhere on the website. They might require you to register for an access key, limit requests, cite them, or follow some uncommon procedure for access. As a data-seeker, you should read pay attention to guidelines for practical, ethical, and potentially legal reasons. Often, the practical part is a tutorial for accessing the data.
In myGasFeeds about page, they describe its features and technology, but also inform that the data is freely accessible through their API. It also refers you to the API section that has directions on how to use a generic API key to access the data.
We need to write a program that interfaces with the myGasFeed application, that is, we need to use its API. In myGasFeed's API requests directions, they provide a developer API key to use outside of a production environment. There's also a template URL to request all the gas stations in a certain radius, centered at a specified latitude/longitude. The response will be in JSON format.
Our program must generate the URL for a query, make a request for the data at that URL, then parse the response into a pandas.DataFrame
.
First, a function to generate the URL requires a starting location, query radius, type of fuel, sorting criterion, and the API key. New Orleans, Louisiana will be our first example, but you can use any U.S. location you'd like. Just be sure that you put negatives in front of latitude if given in degrees South, and longitude if given in degrees West.
def make_url(lat, long, distance, fuel_type, sort_by, key="rfej9napna"):
url = "http://devapi.mygasfeed.com/"
url += "stations/radius/%s/%s/%s/%s/%s/%s.json?callback=?"%(lat, long, distance,
fuel_type, sort_by, key)
return url
nola = (29.9511, -90.0715)
gas_stations_url = make_url(*nola, 40, "reg", "price")
The content of the gas_stations_url is the string
'http://devapi.mygasfeed.com/stations/radius/29.9511/-90.0715/40/reg/price/rfej9napna.json?callback=?'
and represents the URL to request all gas station data within 40 miles of the center of New Orleans. Note that "reg" corresponds to regular gas, "price" means the gas stations are sorted by price, and the requested radius must be less than 50 miles.
Next, we have to actually request the data using the well-named "requests" module. With it, we can send a "GET" request to a specified URL with the requests.get
function. The myGasFeed API allows for GET requests, so it knows how to send data back to our variable.
import requests
response = requests.get(gas_stations_url,
headers={"user-agent":"Jeffrey Lemoine, jeffmlife@gmail.com"})
Notice that the headers parameter has my name and email. This is not necessary but is good practice; for someone checking for malpractice, a name and email might convince them you're not a bot.
The response variable is a requests.models.Response
object. The data we requested is in the text attribute. Let's check by printing the first 100 characters in the response text.
print(response.text[0:100])
prints
'?({"status":{"error":"NO","code":200,"description":"none","message":"Request ok"},"geoLocation":{"country_short":null,"lat":"29.9511","lng":"-90.0715","country_long":null,"region_short":null,"region_long":null,"city_long":null,"address":null},"stations":[{"country":"United States","zip":"70458","reg_price":"N\\/A","mid_price":"N\\/A","pre_price":"N\\/A","diesel_price":"3.85","reg_date":"7 years ago","mid_date":"7 years ago","pre_date":"7 years ago","diesel_date":"7 years ago","address":"3898 Pontch'
Though messy-looking, it appears we received a valid response since there were no error codes and the request is said to be "ok." However, the JSON response is wrapped between '?(
and )'
. This can be solved with a short function to parse responses into a valid JSON string.
import json
def parse_response(response):
# get text
response = response.text
# clean response text
# initial response is wrapped in '?(...)'
response = response[2:][:-1]
# make json
data = json.loads(response)["stations"]
return data
json_data = parse_response(response)
json.loads
requires the JSON standard, so trimming the response is necessary to prevent a JSONDecodeError
. Since we are interested in gas station locations, only the "stations" values of our response are required. The return type of parse_response
is a list
of same-keyed dict
's. For example, a random element in json_data
has the dict
{'address': '1701 Highway 59',
'city': None,
'country': 'United States',
'diesel': '1',
'diesel_date': '7 years ago',
'diesel_price': 'N/A',
'distance': '29.2 miles',
'id': '73100',
'lat': '30.374134',
'lng': '-90.054672',
'mid_date': '7 years ago',
'mid_price': '3.49',
'pre_date': '7 years ago',
'pre_price': '3.61',
'reg_date': '7 years ago',
'reg_price': '3.29',
'region': 'Louisiana',
'station': None,
'zip': '70448'}
Everything checks out (except that the data is outdated 7 years).
Thankfully, a list of dicts can be transformed into a DataFrame in one line, but some more processing is required.
from pandas import DataFrame, to_numeric
gas_stations = DataFrame(json_data)[["lat","lng"]]
gas_stations.columns = ["lat","lon"]
gas_stations["lon"] = to_numeric(gas_stations["lon"])
gas_stations["lat"] = to_numeric(gas_stations["lat"])
In that same line, only latitude and longitude columns were included. Then, the columns were renamed because I don't like "lng" as a shorthand for longitude. The coordinate values were converted from strings to numbers. The final DataFrame looks like:
and has 465 rows.
Voil! We now have the geological coordinates of gas stations within 40 miles of the center of New Orleans. But wait, we only used a 40-mile radius and are limited by 50 miles, what if you wanted more? The next section will take our small example and scale it to any radius.
Here's our problem: we can only request locations less than 50 miles from a specified center in a single query but want locations more than 50 miles away. We also want to minimize queries to be respectful to myGasFeed.com, and because we programmers strive for optimal code. I figured that we should choose different centers such that there is minimal overlap between radii but still encompass all space. Since we're dealing with circles it's impossible to avoid some overlap, but we can try to minimize it.
Let's use Denver, Colorado as our new center and draw a circle with a 49-mile radius around it.
Our previous code can already find all the gas stations within that radius. Next, we will figure out a way of generating centers that increase coverage. The nave, but simplest, approach is to expand the centers outward in a gridlike fashion. The distance between centers can be the circle radius (49 miles) times the Euclidean distance on a grid. Each of those centers is a separate query to myGasFeed, so the data will need to be consolidated.
Now, let's write the function to generate the nave grid before we make things more complicated. The procedure is simple. First, make an n x n grid. Then, for each square in the grid (there will be n^2 of them) calculate how far it is from the origin, as well as its angle with respect to True north. With a starting coordinate (Denver) and the distance and direction of a grid point, we can calculate where on Earth that point would be.
from numpy import *
from numpy.linalg import det
from itertools import product
from geopy.distance import geodesic, Point
def angle_between(v1, v2):
"""Calculates angle b/w v1 and v2 in degrees"""
# dot product ~ cos()
dot_product = dot(v1, v2)
# determinant ~ sin()
determinant = det([v1, v2])
# tan = sin()/cos() ==> = arctan(sin()/cos())
angle_radians = arctan2(determinant, dot_product)
# Convert to degrees in [0,360)
return (angle_radians * (180 / pi) + 360) % 360
def expand(center, radius, n_radii):
# Define starting point
start = Point(*center)
# Generate square grid of shape (n_radii x n_radii)
rng = arange(start = -n_radii,
stop = n_radii + 1,
step = 1,
dtype = float64)
grid = list(product(rng, rng))
# Remove center square [0,0]; no calc. required
grid.remove(grid[len(grid)//2])
# Reference direction
true_north = array([0,1], dtype=float64)
new_centers = [center]
for square in grid:
# Calculate clockwise angle of square center wrt.
# true North in degrees [0,360)
bearing = angle_between(square, true_north)
# Calculate distance to travel
euclidean = lambda p1, p2: sqrt(sum((p1-p2)**2))
dist = radius * euclidean(square, array([0,0]))
# Find coord. of point *dist* miles away at *bearing* degrees
# Using geodesic ensures proper lat, long conversion on ellipsoid
lat, long, _ = geodesic(miles = dist).destination(point=start, bearing=bearing)
new_centers.append((lat, long))
return new_centers
points = expand(denver, 49, 1)
Admittedly, my first version of expand
code could only expand once to produce a center at every 45, or a 3 x 3 lattice of points. When attempting to expand to additional radii, calculating the angle with respect to true north (i.e., the bearing) had to be generalized, which is what angle_between
addresses; if you're interested in how angle_between
works, check out its linear algebra here.
As for the details of expand
, it's sufficient to read its comments for an understanding. Long story short, points
contains a 3 x 3 nave grid of coordinates centered at Denver, scaled by 49 miles. If we plot these coordinates with their respective circles, we can see substantial overlap.
While the nave grid indisputably achieves more-than-full coverage of the area, it's visually apparent that the grid is too tightly-spaced. Thus, there should be a number we can multiply the grid space with to maximally separate the circles without missing any space. But which number?
The solution for how to best distance the grid points for minimally sufficient circle overlap was surprisingly quite simple. Notice that within the center circle of the nave 3x3 grid image (above) there was a square. Its diagonal is equal to the diameter of the circle; therefore, it's the largest possible square that can fit inside a circle with that radius. Interestingly, when I put the squares inside the other circles, even they overlapped inside the nave grid.
We can visualize the inner square overlap through the following color scheme: green for no overlap, yellow for two squares, and red for four squares.
What happens if we were to minimize the overlap between maximal squares such that all inner space is still covered? It would just be a grid of squares and, under our previous visualization, they'd all be green squares. Most importantly, it turns out that organizing circles in a grid according to their maximal inner square is the optimal grid for minimizing circle overlap (and maintaining full coverage).
I will now informally prove this. Imagine a grid of squares. At each corner, there are four squares surrounding it making a cross. If you were to nudge the bottom left (blue) square towards the center (red arrow), additional overlapping occurs; if you were to move the square away from the center (green arrow), then you'd lose full coverage.
While it's hard to see, moving the blue square away from the center really would lose coverage. To grasp this, look at the black line between the blue and the orange circle. There is only a single point, at the center of the grid on their square's diagonal, in which they touch this line. This is equally true for the rest of the circles. Therefore, any movement away from the circle would not be full coverage.
Even more, this logic applies to the whole grid because traveling along the green arrow for one center, is a red arrow at another (except for the edges).
To actually implement the optimal grid, all we need to do is find the scaling factor for our grid space. Recall that the nave grid's number, the circle's radius, was too small. Also, we already know the radius of the circle. We are missing the maximal square's dimension, x, which we've shown is the scaling factor we require.
The optimal grid's scaling factor is the square root of two (~1.414) times the radius of the circle. It separates our grid space such that the circles of that radius have their maximal squares lined up, and their overlap minimized.
All we have to do to our original expand
function is multiply the distance to travel by the square root of two:
...
dist = euclidean(square, array([0,0])) * radius * sqrt(2)
...
Now, a single expansion looks like:
The expand
function only generates coordinate centers that we need to query. A new function is required that can query all the new points. Although, this time we need a way to deal with failed requests, as well as the rate at which we request.
The procedure is as follows. First, generate the list of URL's to process from the points returned from expand
, and put them in the list to track whether or not they've been processed. When that list is empty, return the responses. All that "process" means is to use our earlier function, parse_response
, and store its result.
from time import sleep
def query(points):
# List of URL's that need to be processed
to_process = [make_url(*point, 49, "reg","price") for point in points]
# Current url to process
url = to_process.pop()
data = {}
while(to_process) :
response = requests.get(url,
headers={"user-agent":"Jeffrey Lemoine, jeffmlife@gmail.com"})
# try to parse, except when bad response
try:
json_data = parse_response(response)
except (json.JSONDecodeError):
# Push failed url to front
to_process = [url] + to_process
continue
finally:
url = to_process.pop()
# Limit requests to one per sec.
sleep(1)
print("\r%d left to process"%(len(to_process)), end="")
data[url] = json_data
return data
json_data = query(points)
Inside the while loop, there is a try-catch-finally statement that ensures proper handling of a bad request; if there is a bad request, the response is not our JSON data, so it would throw a json.JSONDecodeError
when trying to parse it. When a request fails, we don't want the loop to iterate and try that same URL over and over again, so we push the failed URL to the front of to_process
. Finally, (that is, after either a successful or a failed request) pop
the end of to_process
, and sleep
one second so that we are respectful to the server.
And, with just a little more processing...
from itertools import chain
stations_list = list(chain.from_iterable(json_data.values()))
gas_stations = pandas.DataFrame(stations_list)[["lat","lng"]]
gas_stations.columns = ["lat","lon"]
gas_stations["lon"] = to_numeric(gas_stations["lon"])
gas_stations["lat"] = to_numeric(gas_stations["lat"])
gas_stations = gas_stations.drop_duplicates().dropna()
gas_stations.index = range(len(gas_stations))
We are back to a DataFrame with our data in the proper numerical types. Since the circle's we generated had overlap, a drop_duplicates
call is necessary to get unique data points.
The experience gained from this procedure is important to highlight. Not all data acquisition is as straightforward as a click-to-download, or as convoluted as the type seen in this article. Often, you are not given instructions on how to do get the data you need in the format you need it, so some hacking is required to meet your needs. On the other hand, you should ask around, or email a data provider, before you spend hours trying to get an API to work the way you'd like; there might already be a way, so don't "reinvent the wheel" unless you have to (or unless you're writing a tutorial on gathering data).
It turns out that the problem we formulated, minimizing overlap between a grid of circles, is already an artistic concept (specifically in "sacred geometry"), with its own Wikipedia page. Furthermore, I should emphasize that the optimal solution we found is only for a square lattice. There are other constructions with different grids. For example, a triangular grid with 19 circles :
I encourage you to figure out what the optimal configuration of circles with any underlying grid space (I actually don't know yet), and see if you can code it up.
If you like the images I generated, I used plotly's mapbox for plotting points on a map, and Lucidchart for drawing on them.
]]>Receiver Operating Characteristic (ROC) plots are useful for visualizing a predictive models effectiveness. This tutorial explains how to code ROC plots in Python from scratch.
Were going to use the breast cancer dataset from sklearns sample datasets. It is an accessible, binary classification dataset (malignant vs. benign) with 30 positive, real-valued features. To train a logistic regression model, the dataset is split into train-test pools, then the model is fit to the training data.
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import *
from sklearn.model_selection import train_test_split
# Load datasetd
dataset = load_breast_cancer()
# Split data into train-test pools
train, test, train_labels, test_labels = train_test_split(dataset['data'],
dataset['target'],
test_size=0.33)
# Train model
logregr = LogisticRegression().fit(train, train_labels)
Recall that the standard logistic regression model predicts the probability of a positive event in a binary situation. In this case, it predicts the probability [0,1] that a patients tumor is benign. But as you may have heard, logistic regression is considered a classification model. It turns out that it is a regression model until you apply a decision function, then it becomes a classifier. In logistic regression, the decision function is: if x > 0.5
, then the positive event is true (where x is the predicted probability that the positive event occurs), else the other (negative) event is true.
With our newly-trained logistic regression model, we can predict the probabilities of the test examples.
# Rename, listify
actuals = list(test_labels)
# Predict probablities of test data [0,1]
scores = list(logregr.predict_proba(test)[:,1])
# Equivalently
import math
def sigmoid(x):
return 1 / (1 + math.exp(-x))
scores = [sigmoid(logregr.coef_@test_i + logregr.intercept_) for test_i in test]
While the probabilities were continuous, we can discretize predictions by applying the decision function, the standard application of logistic regression.
# Predict binary outcomes (0,1)
predictions = list(logregr.predict(test))
# Equivalently
predictions = [1 if s>0.5 else 0 for s in scores]
And measure the accuracy of those predictions.
print("Accuracy = %.3f" % (sum([p==a for p, a in zip(predictions, actuals)])/len(predictions)))
[Out]
Accuracy = 0.957
To visualize these numbers, let's plot the predicted probabilities vs. array position. The higher an example's position on the vertical axis (closer to P=1.0), the more likely it belongs to the benign class (according to our trained model). On the other hand, there is no significance horizontal distribution since it's just the position in the array; it's only to separate the data points. Blue circles represent a benign example; red squares, malignant. The line at P=0.5 represents the decision boundary of the logistic regression model.
Anything above the line is classified as benign, whereas on and below are classified as malignant. Under this visualization, we can describe accuracy as the proportion of points placed inside their correct color.
However, while statistical accuracy accounts for when the model is correct, it is not nuanced enough to be the panacea of binary classification assessment. If you want to know more about the problems with accuracy, you can find that here. For now, we can review the confusion matrix and some of its properties to dig deeper into assessing our model.
When calculating the probabilities of our test data, the variable was deliberately named scores
instead of probabilities
not only because of brevity but also due to the generality of the term 'scores'. In the case of logistic regression, we've considered the predicted probabilities as the scores, but other models may not use probability.
The confusion matrix is a 2x2 table specifying the four types of correctness or error. There are articles on confusion matrices all over, so I will simply describe the table elements in terms of our model:
We can easily represent the confusion matrix with the standard library's collections.namedtuple:
import collections
ConfusionMatrix = collections.namedtuple('conf', ['tp','fp','tn','fn'])
To calculate the confusion matrix of a set of predictions, three items are required: the ground truth values (actuals
), the predicted values (scores
), and the decision boundary (threshold
). In logistic regression, the threshold of 0.5 is the ideal optimal threshold for distinguishing between the two classes because of its probabilistic origins.
def calc_ConfusionMatrix(actuals, scores, threshold=0.5, positive_label=1):
tp=fp=tn=fn=0
bool_actuals = [act==positive_label for act in actuals]
for truth, score in zip(bool_actuals, scores):
if score > threshold: # predicted positive
if truth: # actually positive
tp += 1
else: # actually negative
fp += 1
else: # predicted negative
if not truth: # actually negative
tn += 1
else: # actually positive
fn += 1
return ConfusionMatrix(tp, fp, tn, fn)
With our current data, calc_ConfusionMatrix(actuals, scores)
returns
[Out]
conf(tp=120, fp=4, tn=60, fn=4)
The four confusion matrix elements are the inputs to several statistical functions, most of which are listed/explained on Wikipedia. One of which we've already mentioned: Accuracy. Before, we directly calculated Accuracy by just checking whether predictions were equal to actuals. Instead, we can use the Confusion Matrix equation for finding Accuracy:
def ACC(conf_mtrx):
return (conf_mtrx.tp + conf_mtrx.tn) / (conf_mtrx.fp + conf_mtrx.tn + conf_mtrx.tp + conf_mtrx.fn)
This equation makes sense; it's the proportion of correct predictions (TP's and TN's) out of all the predictions.
The functions we are interested in, however, are called the True Positive Rate (TPR) and the False Positive Rate (FPR).
Mathematically, they are also functions of the confusion matrix:
And in Python:
def FPR(conf_mtrx):
return conf_mtrx.fp / (conf_mtrx.fp + conf_mtrx.tn) if (conf_mtrx.fp + conf_mtrx.tn)!=0 else 0
def TPR(conf_mtrx):
return conf_mtrx.tp / (conf_mtrx.tp + conf_mtrx.fn) if (conf_mtrx.tp + conf_mtrx.fn)!=0 else 0
TPR is also called 'sensitivity' or 'recall' and corresponds to the ability to sense, or detect, a positive case. It's a more specific way of being correct than overall Accuracy since it only considers examples that are actually positive. Furthermore, TPR is the probability that the model predicts positive given that the example is actually positive. In our dataset, TPR is the probability that the model correctly predicts benign.
Note that if your model just predicts positive, no matter the input, it will have TPR = 1.0 because it correctly predicts all positive examples as being positive. Obviously, this is not a good model because it's too sensitive at detecting positives, since even negatives are predicted as positive (i.e., false positives).
FPR is also called 'fall-out' and is often defined as one minus specificity, or 1 - True Negative Rate (TNR). FPR is a more specific way of being wrong than 1 - Accuracy since it only considers examples that are actually negative. Furthermore, FPR is the probability that the model predicts positive given that the example is actually negative. In our dataset, FPR is the probability that the model incorrectly predicts benign instead of malignant.
Note that if your model just predicts positive, no matter the input, it will have FPR = 1.0 because it incorrectly predicts all negative examples as being positive, hence the name 'False Positive Rate'. Obviously, this is not a good model because it's not specific enough at distinguishing positives from negatives.
Sensitivity/Specificity Tradeoff
From the similarly-worded TPR and FPR sections, you may have noticed two things you want in a model: sensitivity and specificity. In other words, you want your model to be sensitive enough to correctly predict all positives, but specific enough to only predict truly positives as positive. The optimal model would have TPR = 1.0 while still having FPR = 0.0 (i.e., 1.0 - specificity = 0.0). Unfortunately, it's usually the case where the increasing sensitivity decreases specificity, vise versa. Look again at the decision boundary plot near P = 0.7 where some red and blue points are approximately equally-predicted as positive. If the decision boundary was moved to P = 0.7, it would include one positive example (increase sensitivity) at the cost of including some reds (decreasing specificity).
One of the major problems with using Accuracy is its discontinuity. Consider the fact that all false positives are considered as equally incorrect, no matter how confident the model is. Clearly, some wrongs are more wrong than others (as well as some rights), but a single Accuracy score ignores this fact. Is it possible to account for continuity by factoring in the distance of predictions from the ground truth?
Another potential problem we've encountered is the selection of the decision boundary. As said before, logistic regression's threshold for what is considered as positive starts at 0.5, and is technically the optimal threshold for separating classes. However, what if you weren't using logistic regression or something in which there isn't an understood optimal threshold?
Both of the above problems can be solved by what I've named thresholding. Before, we calculated confusion matrices and their statistics at a static threshold, namely 0.5. But what if we calculated confusion matrices for all possible threshold values?
The logic is simple: make the finite domain of your scoring system ([0,1] in steps of 0.001 in our case), calculate the confusion matrix at each threshold in the domain, then compute statistics on those confusion matrices.
def apply(actuals, scores, **fxns):
# generate thresholds over score domain
low = min(scores)
high = max(scores)
step = (abs(low) + abs(high)) / 1000
thresholds = np.arange(low-step, high+step, step)
# calculate confusion matrices for all thresholds
confusionMatrices = []
for threshold in thresholds:
confusionMatrices.append(calc_ConfusionMatrix(actuals, scores, threshold))
# apply functions to confusion matrices
results = {fname:list(map(fxn, confusionMatrices)) for fname, fxn in fxns.items()}
results["THR"] = thresholds
return results
The most complicated aspect of the above code is populating the results
dictionary. It loops through the **fxns
parameter which is composed of confusion matrix functions, then maps the functions onto all of the recently-computed confusion matrices.
Now, there is no fixed threshold and we have statistics at every threshold so prediction-truth distances lie somewhere within the results dict. But how can we summarize, visualize, and interpret the huge array of numbers?
Recall that the end goal is to assess how quality our model is. We know its Accuracy at threshold = 0.5, but let's try and visualize it for all thresholds. Using our previous construction:
acc = apply(actuals, scores, ACC=ACC)
acc
now holds Accuracies and thresholds and can be plotted in matplotlib easily.
The orange dot shows the Accuracy at threshold = 0.5, valued at 0.957; the blue dot is the best Accuracy at 0.973 when the threshold is at 0.8. Note that the 0.5 was not the best Accuracy threshold and that these values are subject to change if the model were retrained. Furthermore, see that at the edges of thresholds the Accuracy tapers off. Higher thresholds lower Accuracy because of increasing false negatives, whereas lower thresholds increase false positives. The problems of accuracy are still encountered, even at all thresholds. Therefore, it's time to introduce ROC plots.
ROC plots are simply TPR vs. FPR for all thresholds. In Python, we can use the same codes as before:
def ROC(actuals, scores):
return apply(actuals, scores, FPR=FPR, TPR=TPR)
Plotting TPR vs. FPR produces a very simple-looking figure known as the ROC plot:
The best scenario is TPR = 1.0 for all FPR over the threshold domain. One trick to looking at this plot is imagining the threshold as increasing from right to left along the curve, where it's maximal at the bottom left corner. This makes sense because, in general, at higher thresholds, there are less false positives and true positives because the criteria for being considered as positive are stricter. On the other end, lower thresholds loosen the criteria for being considered positive so much that everything is labeled as positive eventually (the upper right part of the curve).
The worst scenario for ROC plots is along the diagonal, which corresponds to a random classifier. If the curve dipped beneath the random line, then it's non-randomly predicting the opposite of the truth. In this case, just do the opposite of whatever the model predicts (or check your math) and you'll get better results.
The most important thing to look for is the curves proximity to (0, 1). It means that it is balancing between sensitivity and specificity. While the curve tells you a lot of useful information, it would be nice to have a single number that captures it. Conveniently, if you take the Area Under the ROC curve (AUC), you get a simple, interpretable number that is very often used to quickly describe a model's effectiveness.
The AUC corresponds to the probability that some positive example ranks above some negative example. You can go deep into this interpretation here. Nevertheless, the number gets straight to the point: the higher the better. It factors in specificity and sensitivity across all thresholds, so it does not suffer the same fate as Accuracy.
For our dataset, we computed an AUC of 0.995 which quite high. However useful, sometimes you want to get more specific than a generic number across all thresholds. What if you only care about thresholds above 0.9? Or, what if a false negative has severe consequences? Those kinds of questions can be addressed elsewhere.
This tutorial was a pedagogical approach to coding confusion matrix analyses and ROC plots. There is a lot more to model assessment, like Precision-Recall Curves (which you can now easily code). For further reading, I recommend going to read sklearn's implementation of roc_curve.
]]>Analyze the running time of the following algorithm using Big-Oh notation
maxerea =0; maxpt1 =maxpt2 = maxpt3 =0;
for(i =1;i<=n;i++)
for(j =i+1;j<=n;j++)
for(k =j+1;k<=n;k++){
empty = True;
for(l =1;l<=n;l++){
if((l != i) && (l != j) && (l != k))
if insideTriangle(i,j,k,l) then empty = False;
}
area = TriangleArea(i,j,k);
if (area < maxarea) {
maxarea =area;
maxpt1 = i; maxpt2 = j; maxpt3 = k;
}
}
print, maxpt1, maxpt2, maxpt3
]]>Here we can solve just year of birthday
But
How can we get year of Month and Day
Whats up??
Recently I came across InterviewBit Academy Course. Link - InterviewBit Academy
It's is a 6 Months immersive online program that helps you develop your tech skills and gets you your dream job at no upfront cost
It comes with a promise of guaranteed salary in the range of $75K - $120K once we finish the course.
This line caught my attention
"If you dont get a job, you dont need to pay a penny".
Plus, it seems they have a lot of positive reviews - https://www.quora.com/What-is-your-review-of-InterviewBit
Im aware of lambda school which also guarantees the same.
This amazes me. I mean this is the future of education. They are investing in their students.
Instead of taking loans or paying huge fees in advance. This is a way to go.
What is your view on this?
]]>Algorithm A1 takes n^2 days to solve a problem of size n .
Algorithm A2 takes 2^n seconds on the same problem .
Detrmine how large of a problem instance is required before algirithm A1 becomes faster than algorithm A2 .
How much time do the two aldorithms take on that instance ?