Exploratory Data Analysis II: Belfast Trees

For the second edition of my exploratory data analysis series, I shall be analysing data on trees in the city of Belfast. The dataset is supplied and made accessible by Open Data NI.

In this article I will be showing analysis I have done on the dataset, ranging from clustering of species to nonparametric statistical distributions. I used Wolfram Mathematica to perform all the analyses and generate all images. All the code here is in the Wolfram Language (WL), which is superb for data analysis, especially in a context such as this, where real-world is being analysed. 

By its exploratory nature, this is a lengthy article. I hope that it inspires you to be more curious about the world around you and to explore it further--to ask more questions.

Import Dataset

Having downloaded the data, we can import it into Mathematica and automatically extract row and column headers with the help of the SemanticImport function.

dataset = SemanticImport["/Users/marti/Downloads/odTrees.csv"]/. "" -> "N/A";
RandomSample[dataset, 3]
random_sample.png

Let's see the dimensions of the data so we can have an idea how much data we're dealing with.

dims = {Times @@ #, First@#, Last@#} &@Dimensions@dataset;
dimsGrid = Grid[Transpose@{{"Elements", "Rows", "Columns"}, dims}, gridOpts]

From the sample, we can see the different properties that each of the 37k + entries has. We can also see that there are errors in some of the entries, e.g. the second tree has a diameter of -17cm. We'll further investigate this later. 

Overview of data

One of the key strengths of the Wolfram Language is automation.

Creating a function that will automatically extract and visualise information on each property will aid our understanding of the dataset.

set = Reverse@Sort@dataset[GroupBy@property, {Length@#, #} &];
 (* plot a barchart of the top 10 most occuring trees *)
 chart = BarChart[Reverse@Take[Normal[First /@ set], UpTo@10], 
 BarOrigin -> Left, ChartLayout -> "Grouped", ChartStyle -> colorFunc, LabelingFunction -> (Placed[Style[#1, 8], After] &)];
 (* no. and list of sub-properties *)
 count = Row[{"No. of types: ", Length@set}];
 types = Sort@Normal@Keys@set;

 Column[{Column[{count, Short@types}], chart}, Spacings -> 2, 
Dividers -> Center, FrameStyle -> {Thin, LightGray}]
 ];

(* visualise properties interactively *)
Manipulate[extractProperties@type, {type, {"TYPEOFTREE", "DESCRIPTION", "SPECIESTYPE", "SPECIES","AGE", "TREESURROUND", "CONDITION"}}]

Sub-properties

Age vs. x

What we're interested in here, is the relationship between one property and the sub-properties of another. Say for instance, we wanted to know which age group has the highest number of distict species, and which species occur the most within that particular age group.

(* find subcategories within the age category *)
setOfAgeFunction[subCategory_String] := Query[All, Tally, subCategory]@dataset[GroupBy[#[[All, {"AGE", subCategory}]], First] &];

sortLevel3[set_] := Reverse@SortBy[Reverse@SortBy[#, Last] & /@ set, Total];
sortLevel3@setOfAgeFunction["SPECIESTYPE"]

Generalising this function to enable us to compare any two properties would be very useful.

setOfThis$That[setOfThis_String, subCategory_String] := Query[All, Tally, subCategory]@
 dataset[GroupBy[#[[All, {setOfThis, subCategory}]], First] &];

setOfAge$Descr = sortLevel3@setOfAgeFunction@"DESCRIPTION";
setOfSpecies$Age = sortLevel3@setOfThis$That["SPECIESTYPE", "AGE"];
setOfSurround$Age = sortLevel3@setOfThis$That["TREESURROUND", "AGE"];


x = SpeciesType

Now we want to see the age distribution of the five most occurring tree types in Belfast.

Block[{assocOriginal = Normal@Take[Normal[setOfAge$Surround], 5], assocNew},
 assocNew = Association[assocOriginal[[#, 1]] -> Reverse@Sort@Association[Rule @@@ assocOriginal[[#, 2]]] & /@ 
Range@Length@assocOriginal];
 BarChart[Reverse@assocNew, ChartStyle -> "StarryNightColors", ColorFunction -> Automatic, ChartLabels -> {Automatic, None}, 
ChartLayout -> "Stepped", LabelingFunction -> (Placed[Style[#1, 8], After] &), BarSpacing -> {0, 1}]]

x = Description

By doing the same for age and description, we discover that there are subgroups within the Juvenile group: Extra Heavy, Heavy and Select Standards.

{BarChart[SortBy[Association@Thread[Normal[Keys@setOfAge$Descr] -> 
 Normal[Values@setOfAge$Descr] /. {descr_String, count_Integer} :> count], Total],
BarOrigin -> Left, ChartLayout -> "Stepped", LabelingFunction -> After, ChartLegends -> None],
 setOfAge$Descr}

x = TreeSurround

We can also find out the age distribution of trees in the top five surroundings where they are planted in Belfast.

Block[{assocOriginal = Normal@Take[Normal[setOfSurround$Age], 5], assocNew},
 assocNew = Association[assocOriginal[[#, 1]] -> Reverse@Sort@Association[Rule @@@ assocOriginal[[#, 2]]] & /@ 
Range@Length@assocOriginal];
 BarChart[Reverse@assocNew, ChartStyle -> "StarryNightColors", ColorFunction -> Automatic, ChartLabels -> {Automatic, None}, 
ChartLayout -> "Stepped", LabelingFunction -> (Placed[Style[#1, 8], After] &), BarSpacing -> {0, 1}]]

We can also visualise this information to resemble a tree trunk. I have removed the labels to keep the figure neat, but with reference to the numbers in the chart above, it can easily be understood.

Looking deeper into the relationship between the ages and surroundings of the trees, we can tabulate the results so as to offer some insight into where (surrounding not location) one is most likely to find what type of tree. Following these steps:

  • Extract the categories surrounding (rows) and age (column). 
  • Sort each row and column in descending order.
  • The orange represents the maximum value in a row, and therefore, each row has one orange value.
  • Bold figures show the maximum within the column, and so, each column has a bold number too. 
  • Orange and bold values are maximum for both the corresponding surrounding and age.
We see that one is most likely to find a Juvenile tree near a tarmac, or a paved-brick-paviour, but never on a flower bed.

The Origin of Species

Note: I am only using this title as it seems rather apt for this section... 

Let's focus on the species for a bit. We know that there are 53 different types of trees, but there are many more different types of species -- 189 in total.

{allSpeciesTypes, allSpecies} = Reverse@SortBy[Tally@dataset[;; , #], Last] & /@ {"SPECIESTYPE", "SPECIES"}
allSpeciesTypes = Normal@allSpeciesTypes[[;; , 1]];
allSpecies = Normal@allSpecies[[;; , 1]];

In the image above we see the frequencies of the different species and tree types in the dataset. Having plotted a bar chart of the 10 most occurring in each group earlier, we'll now use word clouds to give a slightly different perspective. On the left we have the word cloud showing the relative 'sizes' of each item, and on the right, a short list showing the 10 most occurring items, as in the bar chart.

makeWordCloud[category_String] := Block[{tally = Reverse@SortBy[Tally@Normal@dataset[;; , category], Last]},
Grid[{{WordCloud[tally],Column[Row[#, " - "] & /@ Take[tally, 10], Alignment -> "-"]}}, 
 Dividers -> {Center, False}, FrameStyle -> Directive[Gray, Dotted], Alignment -> Center, Spacings -> {2, 1}]];
makeWordCloud[#] & /@ {"SPECIESTYPE", "SPECIES"} // Column[#, Spacings -> 2] &

By comparing both word clouds and the adjacent figures, one may speculate that the 'Tilla' species are lime, the 'Acer' ones Maple and the 'Fraxinus' ones Ash. But we don't really know how the species relate to the species types. So let's find out under what type each species falls.

(* see the different species of lime, maple and ash tress *)
Table[(Style[type, Bold] -> Column[ Normal[DeleteDuplicates@Normal@dataset[Select[(#"SPECIESTYPE" == type) &], 
 {"SPECIESTYPE", "SPECIES"}]][[;; , 2, 2]]]), {type, {"Lime", "Maple", "Ash"}}] 

We speculated rightly, it seems. We'll now organise everything into a single graph.

makeSpeciesRules[speciesType_String] := Thread[speciesType <-> Normal[DeleteDuplicates@Normal@dataset[
 Select[(#"SPECIESTYPE" == speciesType) &], {"SPECIESTYPE", "SPECIES"}]][[;; , 2, 2]]];
rootEdgeList = Thread[Framed["ROOT"] <-> allSpeciesTypes];
speciesEdgeList = Flatten[makeSpeciesRules[#] & /@ allSpeciesTypes];
allEdges = Join[rootEdgeList, speciesEdgeList];

g = Graph[allEdges, Properties -> Table[allSpeciesTypes[[i]] -> {VertexLabels -> 
 Placed[ Style[allSpeciesTypes[[i]], Orange, 10], After]}, {i, Range@Length@allSpeciesTypes}],
EdgeStyle -> Flatten[{Thread[rootEdgeList -> {Black, Thin, Opacity[.7]}, List, 1], 
 Thread[speciesEdgeList -> {GrayLevel[.5], Thickness[.0005], Dotted}, List, 1]}],
VertexSize -> .3, VertexStyle -> Directive[LightBlue, EdgeForm[Gray]], VertexLabels -> "Name", ImageSize -> 1000]

Clustering

By forming a hierarchical cluster plot, we can group all species. We see that tree types with very few species are grouped together, while the others surround them.

clustTree = Block[{tally = Reverse@SortBy[Tally@Normal@dataset[;; , "SPECIES"], Last]},
ClusteringTree[tally[[;; , 1]], ClusterDissimilarityFunction -> "Complete", GraphLayout -> "RadialDrawing", VertexSize -> 0, ImageSize -> 1000]]

We can also create a Dendogram using Dendrogram[clustTree] although this will have to be very well enlarged to be clear

Graph theory

We can attempt visualising the different subgroups of species (genuses and families).

HighlightGraph[clustTree, Map[Subgraph[clustTree, #] &, FindGraphCommunities[clustTree, Method -> "Spectral"]], 
 GraphHighlightStyle -> "DehighlightFade", ImageSize -> 1000]

Within such a rich network, one may be curious to know which species is farthest from the ROOT. Also know as the periphery of a graph, this is the most distant species from at least one other species in the graph).

(* clustering tree properties *)
propAssoc = PropertyValue[clustTree, "LeafLabels"];
propAssocReversed = Association@Thread[Values@propAssoc -> Keys@propAssoc];

propAssoc[[Key[#]]] & /@ GraphPeriphery[clustTree] // Column
Pinus (type)
Pyrus (type)
Malus (type)
Ulmus (type)
Prunus 'Umineko'
Prunus 'Kanzan'

And the centre of the graph (i.e., the vextex with the least eccentricity).

propAssoc[[First@GraphCenter@clustTree]] Tilla x europaea

And also the distances between vertices. The brighter the region, the farther the two corresponding vertices are from each other, and vice versa.

mat = GraphDistanceMatrix[clustTree];
{MatrixPlot[mat, ColorFunction -> "StarryNightColors", ImageSize -> 300], 
 Histogram[Flatten[mat], ColorFunction -> 
   ColorFunction -> Function[{x, y}, ColorData["NeonColors"][y]], ChartStyle -> "StarryNightColors", ImageSize -> 400]}

Given the linear nature of its connectivity, we'd only expect a single flow from the graph centre to any other vertex. Nontheless, this is very useful for understanding the relationship between species, with regards to classification of species. So to go from one type of tree to another, one must pass through the ROOT! Let's visualise the flow from Sorbus Spp (cherry tree) to Prunus 'Kanzan' (rowan tree).

{Graph[FindMaximumFlow[g, "Sorbus Spp", "Prunus 'Kanzan'", "EdgeList"], VertexLabels -> "Name", ImageSize -> 1000], 
  Show[FindMaximumFlow[g, "Sorbus Spp", "Prunus 'Kanzan'", "FlowGraph"], ImageSize -> 1000]} // Column

Tree dimensions

Whilst doing my analysis, I came across an issue I'd like to highlight. There are some negative tree dimensions, which cannot (physically) be. Here we see cases where any of the diameter, spread radius or height of a tree can be negative.

faultyData = Select[dataset, (#"DIAMETERinCENTIMETRES" < 0 || #"SPREADRADIUSinMETRES" < 0 || #"TREEHEIGHTinMETRES" < 0) &];
{Length@faultyData, RandomSample[faultyData[;; , {"DIAMETERinCENTIMETRES", "SPREADRADIUSinMETRES", "TREEHEIGHTinMETRES"}], 5]}

I also found 10 cases in which all dimensions where all dimensions were recorded as negative. I derived a new dataset of only positive dimensions and used it from this point onwards. You can see the min and max values of the new dataset below.

datasetPos = dataset[Select[(#"DIAMETERinCENTIMETRES" >= 0 && #"SPREADRADIUSinMETRES" >= 0 && #"TREEHEIGHTinMETRES" >= 0) &]] ;
Column[{#, datasetPos[MinMax, #]}] & /@ {"DIAMETERinCENTIMETRES", "SPREADRADIUSinMETRES", "TREEHEIGHTinMETRES"}

The maximum figures seem rather unimaginable. A maximum spread radius of 500m and a maximum height of 832m suggests something could be wrong with the dataset. We'll discuss this further later.

Distributions

Using the FindDistribution function, we can find the distribution of the numerical properties.

Grid[Block[{props = {"DIAMETERinCENTIMETRES", "SPREADRADIUSinMETRES", "TREEHEIGHTinMETRES"}},
MapThread[List, {props,FindDistribution[#, PerformanceGoal -> "Quality"] & /@
 Transpose[Values /@ Normal@dataset[;; , props]]}]], gridOpts]

As expected, we see that all three properties follow a disctrete uniform distribution. Additionally, the diameter can also be described using mixture and geometric distributions.

Diameter

Let's visualise the distribution of the diameters of trees, based on a few properties.

plotHistogram[property_String] := Block[{set, setLen, setMaxAllowed, extractColumn, plots},
set = datasetPos[GroupBy@property]; setLen = Length@set;
setMaxAllowed = If[setLen >= 10,RandomSample[set, 7], set];
extractColumn = setMaxAllowed[;; , ;; , "DIAMETERinCENTIMETRES"];
plots = If[Length@# <= 1,
"Only " <> ToString[Length@#] <> " value(s) available.",
Show[{
Histogram[#, 20, "PDF", Axes -> {True, False}],
SmoothHistogram[#, 5, PlotStyle -> {Orange, Thickness[.002]}, 
 Filling -> Axis, FillingStyle -> Directive[Opacity[0.4], White]]
}, AspectRatio -> 1/3, ImageSize -> 200]
] & /@ extractColumn;
Column[{StringJoin[{property, " (", If[setLen == 7, "7", "random sample of 7,"], " of ", ToString@setLen, ")"}],
Text@Grid[Transpose[(#@Normal@plots) & /@ {Keys, Values}], gridOpts]}, Alignment -> Center]
];

plotHistogram /@ {"AGE", "CONDITION", "SPECIESTYPE"}

Here are the histograms of dimensions, based on the different age groups.

Framed[Column[{#, Histogram[Most@datasetPos[GroupBy["AGE"], ;; , #], 
ChartLayout -> "Stacked", ChartLegends -> Automatic, PlotRange -> Automatic, ChartStyle -> "StarryNightColors", ChartBaseStyle -> {Directive[Opacity[1]],EdgeForm[{White, Thin, Opacity[.5]}]}, ImageSize -> 220]}, Alignment -> Center], FrameStyle -> {GrayLevel[.7], Thin}] & /@ {"DIAMETERinCENTIMETRES", "SPREADRADIUSinMETRES", "TREEHEIGHTinMETRES"}

To give a better perspective, here' s a 3D histogram comparing the diameter (y-axis) & height (x-axis).

plotHistogram3D[property_] := Block[{plot},
 plot = Histogram3D[datasetPos[GroupBy["CONDITION"], ;; , {"DIAMETERinCENTIMETRES", "TREEHEIGHTinMETRES"}][property], 
 ColorFunction -> Function[{height}, ColorData["StarryNightColors"][height]], 
 ChartBaseStyle -> Directive[Opacity[.9]], ImageSize -> If[property == #, 500, 200], PerformanceGoal -> "Quality", 
 AxesLabel -> If[property == #, Automatic, None]] &@ "Very Poor";
 Column[{property, plot}, Alignment -> Center]
 ];

Grid[{{plotHistogram3D@"Good", plotHistogram3D@"Fair", plotHistogram3D@"Poor"}, 
{plotHistogram3D@"Very Poor", SpanFromLeft, SpanFromLeft}, 
{plotHistogram3D@"Dying", plotHistogram3D@"Dead", plotHistogram3D@"N/A"}}, Frame -> All, FrameStyle -> {GrayLevel[.7], Thin}]

Classification

The WL has quite extensive machine learning capabilities, as we've seen with clustering. In this section I'll explain how I used machine learning to predict the species type of a tree based on its other properties. This is called supervised learning and can be done by generating a classifier which learns from the examples it is trained with.

We'll have to process the data such that it will become fit (taking in all other properties and giving species type as output) for training in the WL. It is important to note that the training process will be affected by the number of examples of each tree available. Whereas there are up to 6370 examples of Lime, there is only a single example of Rauli and Handkerchief trees.

(* extract properties to be used for prediction *)
d = datasetPos[;; , {"TREEHEIGHTinMETRES", "DIAMETERinCENTIMETRES", "SPREADRADIUSinMETRES", "TYPEOFTREE", "SPECIESTYPE", "AGE", "TREESURROUND", "CONDITION"}];
Multicolumn[Row[#, " - "] & /@ Reverse@SortBy[Tally@Flatten@Values@Normal@dataset[;; , {"SPECIESTYPE"}],Last], 3, Alignment -> Right]

I only included tree types with up to 1000 examples--so from Lime to Rowan.

(* process dataset to make it suitable for training *)
dAll = Cases[Thread[Values@Normal@d[;; , {"TREEHEIGHTinMETRES", "DIAMETERinCENTIMETRES",
 "SPREADRADIUSinMETRES", "TYPEOFTREE", "AGE", "TREESURROUND", "CONDITION"}] -> Normal@d[;; , "SPECIESTYPE"]],
 (l_ -> sp_ /; Or @@
MapThread[sp == # &, {{"Lime", "Maple", "Ash", "Oak", "Pine", "Cherry", "Alder", "Beech", "Cypress", "Hornbeam", "Pear", "Birch", "Poplar", "Rowan", "Willow", "Chestnut", "Larch", "Mixed",
 "Hawthorn", "Hazel", "Plane", "N/A", "Elm", "Yew", "Holly", "Lawson Cypress"}}]) :> (l -> sp), Infinity];
(* randomly rearrange all entries *)
SeedRandom[1]; dAll = RandomSample[dAll, Length@dAll];

(* select training, validation and test sets *)
Block[{l = Floor[0.20*Length[dAll]]},
 dTest = dAll[[;; l]];
 dVal = dAll[[l + 1 ;; 2*l]];
 dTrain = dAll[[(2*l) + 1 ;;]];
 ]

I initially tried a few different classification methods including Nearest Neighbours, Neural Networks and SVMs. I eventually selected the Random Forest method, as it gave the best results.

c = Classify[dTrain,
FeatureNames -> {"Age", "Surround", "Condition", "Height", "Diameter", "Spread"},
FeatureTypes -> <|"Age" -> "Nominal", "Surround" -> "Nominal", "Condition" -> "Nominal", "Height" -> "Numerical", "Diameter" -> "Numerical", "Spread" -> "Numerical"|>,
ValidationSet -> dVal, Method -> "RandomForest", PerformanceGoal -> "Quality"
];

This generated a classifier function:

Now we can use the classifier to figure out was species type a tree probably is, given the preset properties. We'll look at two examples:

  • Tree 1: age = Mature, surround = Woodland, condition = Good, height = 7m, diameter 140cm, spread radius = 3m
  • Tree 2: age = Young, surround = Tarmac, condition = Dying, height = 2m, diameter 15cm, spread radius = 1m
Column[With[{tree = #},
{Framed@c[tree, "Decision"], Grid[c[tree, {"TopProbabilities", 5}] /. Rule -> List, gridOpts]}
], Alignment -> Center] & /@ {{"Mature", "Woodland", "Good", 7, 140, 3}, {"Young", "Tarmac", "Dying", 2, 15, 1}}

Tree 1 is classified as an Alder while Tree 2 is classified as a Lime tree. But how can we tell if these predictions are accurate? Using the test dataset we created earlier, we'll be able to work out the accuracy of the classifier, based on how many tress it predicts right or wrong.

{cm = ClassifierMeasurements[c, dTest], Framed@cm["Accuracy"]} // Column

The accuracy of the classifer is about 0.44, which is pretty low and therefore unreliable. We can easily plot a confusion matrix plot to see its performance for each tree.

From the plot we can see that all of the trees had examples that were misclassified as Maple and Lime trees. This probably owed to the relatively large numbers of these trees in the training set. Interestingly, none of the Ash trees were correctly classified, despite the relatively large number of Ash examples in the training dataset. 

Having an equal number of examples for each tree in the training set might improve the accuracy of the classifier. Furthermore, there are other options which could be use to fine-tune the training process, which I didn't look at. So perhaps one could iteratively try out a combination of different options to obtain a better result.

Neural Networks

Following the doumentation example, I was able to train a neural network (NN) to predict the type of tree (out of Poplar, Oak, Cherry, Alder, Pine) given its surrounding, condition and height. I used 70% of the dataset for training, and 15% for testing and validation each.

(* select training, validation and test sets *)
SeedRandom[1]; dAll = RandomSample[dAll, Length@dAll];

Block[{dAll = Cases[dAll, (l_ -> sp_ /; Or @@
 MapThread[sp == # &, {{"Poplar", "Oak", "Cherry", "Alder", "Pine"}}]) :> (l[[{2, 3, 4}]] -> sp), Infinity], l},
 l = Floor[0.1*Length[dAll]];
 dTest2 = dAll[[;; l]];
 dVal2 = dAll[[l + 1 ;; 2*l]];
 dTrain2 = dAll[[(2*l) + 1 ;;]];
 ]

The following are the different types of trees we're aiming to be able to tell apart using the NN, and the all properties (except height which are discrete integers).

labels = Union[Values@dTrain2],
surroundValues = DeleteDuplicates[dTrain2[[All, 1, 1]]],
condValues = DeleteDuplicates[dTrain2[[All, 1, 2]]]

We'll encode these sets of properties as tensors, to be later used in the NN. For each property, it creates a tensor and assigns a value of 1 for that property and 0 for all others.

encSurround = NetEncoder[{"Class", surroundValues, "UnitVector"}]
encCond = NetEncoder[{"Class", condValues, "UnitVector"}]

Here a small sample of the training dataset.

trainingData = dTrain2; testData = dTest2;
convertRow[{s_, c_, h_} -> tree_] := <|"surround" -> s, "height" -> IntegerPart@h, "cond" -> c, "tree" -> tree|>;
trainingDataset = Dataset[convertRow /@ trainingData];
testDataset = Dataset[convertRow /@ testData];
valDataset = Dataset[convertRow /@ dVal2];
RandomSample[trainingDataset, 5]

The neural network can be represented in a graphical form, showing the inputs and the output.

net = NetGraph[{CatenateLayer[], DotPlusLayer[5], SoftmaxLayer[]},
{{NetPort["surround"], NetPort["height"], NetPort["cond"]} -> 1,1 -> 2 -> 3 -> NetPort["tree"]},
"surround" -> encSurround, "height" -> "Scalar",
"cond" -> encCond, "tree" -> NetDecoder[{"Class", labels}]]

The training can now be done, specifying the validation dataset. We'll also set a limit to the number of training rounds to be carried out to 400.

net = NetTrain[net, trainingDataset, MaxTrainingRounds -> 400, BatchSize -> 256, ValidationSet -> valDataset]

Using the NN, let's find out what a dying 13m-tall tree in , surrounded by Tarmac is likely to be.

net[<|"surround" -> "Tarmac", "height" -> 13, "cond" -> "Dying"|>]
net[<|"surround" -> "Tarmac", "height" -> 13, "cond" -> "Dying"|>, "Probabilities"]

Cherry it says. What if we wanted to know the likelyhood of a tree being a Pine tree? We can create a function that takes in our desired properties and returns the probabilty of such a tree being Pine.

p[surround_String, height_Integer, cond_String] := net[<|"surround" -> surround, "height" -> height, "cond" -> cond|>, {"Probabilities", "Pine"}];
{p["Concrete", 44, "Fair"], p["Woodland", 12, "Poor"]}

{0.506814, 0.215295}

Furthermore, we can find an optimum height at which, in addition to other properties, a tree is likely to be a Pine tree.

ListLinePlot[{
Table[{x, p["Grass", x, "Good"]}, {x, 0, 150}], Table[{x, p["Tarmac", x, "Good"]}, {x, 0, 150}],
Table[{x, p["Grass", x, "Poor"]}, {x, 0, 150}], Table[{x, p["Tarmac", x, "Poor"]}, {x, 0, 150}]
}, PlotLegends -> {"Grass, Good", "Woodland, Good", "Grass, Poor",
 "Woodland, Poor"}, Frame -> True, FrameLabel -> {"Height (m)", "Probability of being Pine"}]
It appears that a tree with a poor condition, especially on grassland, is most likely to be a Pine tree if it’s about 40m tall.

But how accurate are these predictions? Will the NN be any better than the classifer we generated earlier?

Column[{cm = Framed@ClassifierMeasurements[net, testDataset -> "tree", "Accuracy"],
ClassifierMeasurements[net, testDataset -> "tree", "ConfusionMatrixPlot"]}, Alignment -> Center]

The neural network is only slightly more accurate than the Random Forest classifier. Bear in mind though that the properties and trees analysed are different.

Geographics

As I mentioned earlier, the WL is great for computing with real-world knowledge. One of the ever-growing areas in which Wolfram Research has curated, organised and made such knowledge computable is geography. In this section, we'll be exploiting this set of functions not only to visualise the location every tree, but also to learn a few things about their distribution.

The dataset contains the the geographical coordinates (longitude and latitude) of every tree up to a precision of five decimal places.

(* gather the location and condition of every tree *)
d2 = dataset[;; , {"LONGITUDE", "LATITUDE", "CONDITION"}];

When visualising the trees, we can colour each one according to it's condition.

allConditions = DeleteDuplicates[Normal@dataset[;; , "CONDITION"]][[{6, 2, 1, 3, 7, 5, 4}]];
conditionColouring[condition_] := Switch[condition, "Good", Darker@Green, "Fair", Green, "Poor",
 RGBColor[0.9, 1, 0], "Very Poor", Orange, "Dying", Pink, "Dead", Darker@Red, "N/A", Blue];

(* legend of rating *)
ratingLengend = SwatchLegend[conditionColouring /@ allConditions, allConditions];

(* apply properties to each tree *)
d2Coloured = Normal[Values[d2] /. {lon_, lat_, con_} :> {conditionColouring[con], Opacity[.5], PointSize[.007],Point@GeoPosition[{lat, lon}]}];

(* apply property only for the tree condition is specified *)
d2ColouredSelect[condition_String] :=
Normal[Values[d2] /. {lon_, lat_, con_} :> If[con == condition,
{conditionColouring[condition], Opacity[.5], PointSize[.0075], Point@GeoPosition[{lat, lon}]}, Nothing]];

We can also set the range we're interested in. Below are two maps: one covering the entire city and the other covering only half a kilometer.

largeMap = Legended[GeoGraphics[d2Coloured], ratingLengend];
smallMap = GeoGraphics[d2Coloured, GeoCenter -> bel, GeoRange -> Quantity[.5, "Kilometers"]];
Row[{largeMap, smallMap}]

The map on the left is meant to give an idea of how the trees are spread around city. The one the right shows in greater detail, the positions of the trees. One can see the alignment of trees on either side of a street, and indeed, the cluster trees in a park. However, even using both maps, it is difficult to discern which part of Belfast has the highest number of trees just by looking. To find out we'll use specialised visualisations functions in the WL.

GeoHistograms

The GeoHistogram function allows one to plot a histogram one a map with complete control of bin sizes, which may be represented by hexagons, rectangles, triangles, or well-defined geographical boundaries. We begin by attempting to answer the question, which in retrospect, is rather vague:

Which part of Belfast has the greatest tree density?

I found that the answer to this question depends on the area of the "part". As the area of the bin changes, the region with the highest density also changes. To demonstrate this phenomenon, we'll plot GeoHistograms showing the relative densities of trees using bin sizes ranging from 0.5km to 5km, in increments of 0.5km.

dat = Normal[GeoPosition /@ Reverse /@ Values@d];
plotGeoHistogram[squareKm_] := Block[{rectangleSize = Quantity[squareKm, "Kilometers"]},
 Column[{Framed[rectangleSize, FrameStyle -> {GrayLevel[.7], Thin}],
GeoHistogram[dat, {"Rectangle", {rectangleSize, rectangleSize}}, ColorFunction -> (ColorData[{"StarryNightColors", "Reverse"}][#] &)]}, Alignment -> Center]
 ];

geoHistList = Table[plotGeoHistogram[i], {i, Range[.5, 5, .5]}];
Export["GeoHistograms.gif", geoHistList, "DisplayDurations" -> 2, ImageResolution -> 200]
We see that as the bin size approches 5km, the region with the highest denstity moves closer to the centre of the city.

We see that as the bin size approches 5km, the region with the highest denstity moves closer to the centre of the city.

Based on condition

We can also plot geohistograms based on properties. This can be quite insightful when one is keen on knowing where the dying trees are. Another example of its importance would be correlating the location of a plant infection to the location of affected trees.

plotGeoHistoConditions[condition_String, squareKm_: 2] := Block[{geoHist, selectCondition, rectangleSize},
selectCondition = d2ColouredSelect[condition];
rectangleSize = Quantity[squareKm, "Kilometers"];
geoHist = GeoHistogram[selectCondition[[;; , 4, 1]], {"Rectangle", {rectangleSize, rectangleSize}},
ColorFunction -> (Opacity[# - .1, Darker[conditionColouring@condition, .2]] &), ImageSize -> 150, PerformanceGoal -> "Quality"];
Column[{rectangleSize, GeoGraphics[selectCondition, GeoBackground -> GeoStyling[{"GeoImage", Rasterize@geoHist}],
 ImageSize -> 150]}, Alignment -> Center]]

listOfGeoHists = Table[Column[{cond,
 Grid[{ParallelTable[plotGeoHistoConditions[cond, s], {s, {5, 3, 1}}]}, Spacings -> {1, 0}]}, Alignment -> Center, Frame -> True, FrameStyle -> GrayLevel[.8]], 
{cond, {"Good", "Fair", "Poor", "Very Poor", "Dying", "Dead"}}] //Multicolumn[#, 2, Alignment -> Right] &

Kernel density estimation

Using the SmoothKernelDistribution function, we can smoothen the noise of the tree distributions, therby creating precise estimates of the densities of trees in all regions around the city. Following the WL documentation and using the Silverman bandwidth selection method, I plotted the estimated densities around the entire city. By altering the PDF function, the precision of the estimation can be improved.

(* function for automation *)
plotKernelDistribution[bandwith_String: "Silverman", PDFPower_Integer] :=
 Block[{latmin, latmax, longmin, longmax, \[ScriptCapitalD], dens, geoPlot},
{{latmin, latmax}, {longmin, longmax}} = GeoBounds@bel;
\[ScriptCapitalD] = SmoothKernelDistribution[dat[[All, 1]], bandwith];
dens = ContourPlot[Evaluate[PDF[\[ScriptCapitalD], {y, x}]^(1/PDFPower)], {x, longmin, longmax}, {y, latmin, latmax},
Frame -> False, PlotRange -> All, Contours -> 10,
ColorFunction -> (Directive[ColorData[{"StarryNightColors", "Reverse"}][#], Opacity[0.4]] &),
ContourStyle -> {Black, Thin}, PlotRangePadding -> None, PerformanceGoal -> "Quality"];
geoPlot = With[{p = Polygon@bel}, GeoGraphics[{{FaceForm@LightBrown, EdgeForm@Gray, p},
GeoStyling[{"GeoImage", dens}], p, Orange, Opacity[0.3], PointSize[0.0025], Point[dat]},
 GeoZoomLevel -> 12, GeoBackground -> GeoStyling["StreetMap", GeoStylingImageFunction -> (ImageAdjust[ColorConvert[#1, "Grayscale"], {.5, -.2}] &)],
 GeoScaleBar -> Placed["Metric", {Right, Top}], ImageSize -> 500]];
Column[{Framed[PDFPower, gridOpts], geoPlot}, Alignment -> Center]
];

kernelPlotList = ParallelTable[plotKernelDistribution@p, {p, Range[2, 20, 2]}];
Export["SmoothKernelDistribution.gif", kernelPlotList, "DisplayDurations" -> 2, ImageResolution -> 200]
As with the geohistograms, the more yellow the regions is, the smaller the tree density and vice versa.

As with the geohistograms, the more yellow the regions is, the smaller the tree density and vice versa.

We can now apply this to each condition, showing how trees, based on their conditions, are spread across Belfast.

(* function for automation *)
plotKernelDistribution2[condition_] := Block[{latmin, latmax, longmin, longmax, \[ScriptCapitalD], dens, geoPlot, selectCondition},
selectCondition = d2ColouredSelect[condition];
 
{{latmin, latmax}, {longmin, longmax}} = GeoBounds@bel;
\[ScriptCapitalD] = SmoothKernelDistribution[selectCondition[[;; , 4, 1, 1]], "Silverman"];
dens = ContourPlot[Evaluate[PDF[\[ScriptCapitalD], {y, x}]^(1/5)], {x, longmin, longmax}, {y, latmin, latmax},
Frame -> False, PlotRange -> All, Contours -> 5, ColorFunction -> (Directive[ColorData["StarryNightColors"][#], Opacity[0.4]] &),
ContourStyle -> {Black, Thin}, PlotRangePadding -> None, PerformanceGoal -> "Quality"];
geoPlot = With[{p = Polygon@bel}, GeoGraphics[{{FaceForm@LightBrown, EdgeForm@Gray, p},
GeoStyling[{"GeoImage", dens}], p, selectCondition}, GeoZoomLevel -> 12, GeoBackground -> GeoStyling["StreetMap", GeoStylingImageFunction -> (ImageAdjust[ColorConvert[#1, "Grayscale"], {.5, -.2}] &)],
 GeoScaleBar -> Placed["Metric", {Right, Top}], ImageSize -> 250]];
Framed[Column[{Style[condition, conditionColouring@condition], geoPlot},Alignment -> Center], gridOpts]
]

ParallelTable[plotKernelDistribution2@cond, {cond, allConditions}]

Note that the data covers the outskirts of Belfast to the North. Howsever, the geographical boundary we have defined only includes Belfast. That is why we have points outside of the contour plot, although these are a part the data used for the estimation.

3D model

Since we have the location and height of each tree, we can generate a three-dimensional model of the city's trees. The WL is packed with tools for importing, exporting, generating, interacting with and manipulating all sorts of 3D graphics. Of course our model can show the trees based on their properties. We have used the conditions of the trees in previous plots and this time we'll be looking at surroundings.

Firstly, we'll create a function that will extract the geopositions and heights of trees, given they have the specified surrounding.

d3[surround_] := Normal[Values@Select[
datasetPos[;; , {"LONGITUDE", "LATITUDE", "TREESURROUND","TREEHEIGHTinMETRES"}], #"TREESURROUND" == surround &]] 
 /. {lon_, lat_, srnd_, h_} :> {GeoPosition[{lat, lon}], Quantity[h, "Meters"]};

Next, we'll assign a colour to each tree according to its surrounding.

d3[surround_] := Normal[Values@Select[datasetPos[;; , {"LONGITUDE", "LATITUDE", "TREESURROUND", "TREEHEIGHTinMETRES"}], #"TREESURROUND" == surround &]] 
 /. {lon_, lat_, srnd_, h_} :> {GeoPosition[{lat, lon}], Quantity[h, "Meters"]};

colorLists = MapThread[List, {Flatten@Values@Normal@DeleteDuplicates@dataset[;; , {"TREESURROUND"}],
RandomChoice[ColorData["Legacy", "ColorList"], 13]}];

Thirdly, each tree is represented with a line, having a length of that tree's height and a colour of its surrounding.

colouredLines = {(*colour*)#[[2]], Opacity[1],
Line[{Append[Reverse[First@#1], 0], Append[Reverse[First@#1], -QuantityMagnitude[#2]/-10]} & @@@
d3[(*surrounding*)#[[1]]]]} & /@ colorLists

Lastly we'll extract the boundary of the city to form a polygon, upon which we'll plant the trees.

polygonOfMap = Map[Append[Reverse@#, 0] &, EntityValue[bel, "Polygon"] /. GeoPosition -> Identity, {-2}][[1, 1]];

Row[{Graphics3D[{colouredLines, Polygon[polygonOfMap, VertexTextureCoordinates -> {{0, 0}, {1, 0}, {1, 1}, {0, 1}}]}}, ImageSize -> 700, Axes -> False, BoxRatios -> {1, 1, 1}, Lighting -> "Neutral", ViewPoint -> {0, -2 Pi, Pi}, ViewAngle -> All, Boxed -> False],
SwatchLegend[First@#, Last@#, LegendLayout -> (Column[Row[#, " "] & /@ #] &)] &@Reverse@Transpose@colorLists}]

Gorgeous! There seems to be an anomaly though--a tree in a grass surrounding that is just way too tall. We found out that this was the tallest tree in the dataset earlier. We won't bother about the absurd spread radius because the model only accounts for the height of the tree.

(* just to confirm *)
Table[i -> Quantity[IntegerPart@Max[Select[datasetPos[;; , {"LONGITUDE", "LATITUDE", "TREESURROUND",
 "TREEHEIGHTinMETRES"}][[ ;; ]], #"TREESURROUND" == i &][[;; , -1]]], "Meters"],
 {i, Flatten@Values@Normal@DeleteDuplicates@dataset[;; , {"TREESURROUND"}]}] // Multicolumn

There it is, the 832-metre-tall tree! Now we can uproot this tree, hence allowing us to zoom in nicely into the model without any distortions. We'll limit at allowable height to 800m and run the same steps as before to generate a new model.

d4[surround_] := Normal[Values@Select[datasetPos[;; , {"LONGITUDE", "LATITUDE", "TREESURROUND",
"TREEHEIGHTinMETRES"}][[ ;; ]], (#"TREESURROUND" == surround && #"TREEHEIGHTinMETRES" <= 800) &]
 ] /. {lon_, lat_, srnd_, h_} :> {GeoPosition[{lat, lon}],Quantity[h, "Meters"]};

For a better prespective, we'll generate a series of frames, each having a progressively altered view point and view angle. This will then be stitched into an animation.

make3DModel[viewPointY_, viewAngle_] := Row[{Graphics3D[{colouredLines, {Polygon[polygonOfMap, VertexTextureCoordinates -> {{0, 0}, {1, 0}, {1, 1}, {0, 1}}]}}, 
 ImageSize -> 700, Axes -> False, BoxRatios -> {1, 1, 1}, Lighting -> "Neutral", ViewPoint -> {0, viewPointY*Pi, Pi}, ViewAngle -> viewAngle*Degree, Boxed -> False],
 SwatchLegend[First@#, Last@#, LegendLayout -> (Column[Row[#, " "] & /@ #] &)] &@ Reverse@Transpose@colorLists}];

modelList = ParallelTable[make3DModel[i[[1]], i[[2]]], {i, Transpose[Subdivide[# /. List -> Sequence, 50] & /@ {{0, -.4}, {20, 10}}]}];
Export["3DModel_2.gif", modelList, "DisplayDurations" -> .15, ImageResolution -> 200]

Conclusion

In this article--the second of my exploratory data analysis series--data on the trees of Belfast we analysed. Using a range of techniques from the varying fields such as graph theory and statitics, I ran an in-depth analysis gaining insight into trees planted in the city.

A few issues with the dataset were found including entries with impossible dimensions. I presume that these were errors in either data recording or entry. However I highly recommend Open Data NI as a source of vast volumes of freely accessible data (on Northern Ireland). 

The code I wrote to perform all analyses and generate all the diagrams are included. I have tested them all to ensure they're working properly, but if you come across any issues, please do not hesistate to drop a comment. You can run the code and further explore the datase using the Wolfram Open Cloud. It is entirely free!

I hope you have enjoyed reading this article and have learned a thing or two from it. Until next time, keep exploring.