地理空间数据EDA数据探索性分析

Posted on 2021-04-17 | In Python |

Words count in article: 1.4k

EDA——数据探索性分析，是通过了解数据集的基本情况、变量间的相互关系以及变量与预测值之间的关系，为后期特征工程和建立模型做铺垫。本文以智慧海洋建设竞赛为例进行演示。

1. 总体了解数据

1.1 查看样本个数和原始特征维度

1	data_train.shape

1	data_test.shape

1	data_train.columns #查看列名

1
2
3

pd.set_option('display.max_info_rows',2699639)	#提高非缺失值检查的行数上线
#pd.options.display.max_info_rows = 2699639
data_train.info()

1 2	#查看count 非空值数、std 标准差、（25%、50%、75%）分位数等基本情况 data_train.describe([0.01,0.025,0.05,0.5,0.75,0.9,0.99])

地理数据分析常用工具

Posted on 2021-04-15 | In Python |

Words count in article: 1.3k

在地理空间数据分析中，常用一些模块进行地理数据分析、特征提取及可视化，包括shapely、geopandas、folium、kepler.gl、geohash等工具。

1. shapely

shapely是基于笛卡尔坐标的几何对象操作和分析Python库，底层基于GEOS和JTS拓扑运算库。

1.1 Point对象

from shapely.geometry import Point

point1 = Point(1, 1)
point2 = Point(5, 5)
point3 = Point(10, 10)

#点的可视化
geo.GeometryCollection([point1,point2,point3])

#Point转为numpy数组
print(np.array(point))

LightGBM调参_1

Posted on 2021-03-26 | In Machine Learning |

Words count in article: 1.3k

#1简单列举一下日常调参过程中常用的几种方法，具体的原理下次补上。

1. 经验法:

往两个方向调：

1.提高准确率：max_depth, num_leaves, learning_rate

2.降低过拟合：max_bin, min_data_in_leaf；L1, L2正则化；数据抽样, 列采样

1.使用较小的num_leaves，max_depth和max_bin，降低复杂度。

2.使用min_data_in_leaf和min_sum_hessian_in_leaf，该值越大，模型的学习越保守。

Tsfresh——自动化特征工程工具

Posted on 2021-03-25 | In Python |

Words count in article: 675

改进模型的潜在途径之一是：生成更多的潜在特征，输入更多的样本。

Tsfresh是处理时间序列数据的特征工程工具，能够自动计算大量时间序列特征，如平均值、最大值、峰度等。之后，可以使用这些特征集构建机器学习模型。

本文以天池-心跳信号分类预测为例，演示tsfresh工具的用法。

使用示例

1. 合并train和test数据

合并数据集，对整体数据做统一的特征工程。(注意需要为test数据添加label列，值为-1，方便后续操作)

1 2	data_test['label'] = -1 all_data = pd.concat((data_train, data_test)).reset_index(drop = True)

特征选择_1

Posted on 2021-03-20 | In Machine Learning |

Words count in article: 1.1k

在数据预处理过程中，特征选择是一个重要的过程，选择出重要的特征可以加快模型训练速度。通常可以从以下两方面来选择特征：

1.特征是否发散（对于样本区分作用的大小）
2.特征与标签的相关性

模型融合_1

Posted on 2021-03-15 | In Machine Learning |

Words count in article: 821

Kaggle和天池比赛中常用提高成绩的三种方法：

1.特征工程
2.模型调参
3.模型融合
模型融合主要有以下几种方式：

简单加权融合:

①回归（分类概率）：算术平均融合（Arithmetic mean），几何平均融合（Geometric mean）；
②分类：投票（Voting)
③综合：排序融合(Rank averaging)，log融合

stacking/blending:

构建多层模型，把初级学习器的输出当作下一层的输入。

Pandas的一些常用操作_1

Posted on 2021-02-21 | In Python |

Words count in article: 517

今天介绍几个常用的Pandas操作。

1
2
3

import numpy as np
import pandas as pd
df = pd.read_csv('./economics.csv')

1.DataFrame to markdown/latex

dataframe可以转换为许多常用格式，如csv,excel,sql,json,html,latex等等，这里以markdown和latex为例。

1 2	print(df.to_markdown()) print(df.to_latex())

1 2	df.to_markdown('table.md') df.to_latex('table.tex')

Pandas的一些常用操作_2

Posted on 2021-02-21 | In Python |

Words count in article: 404

今天继续介绍几个常用的Pandas操作。

1
2
3

import numpy as np
import pandas as pd
df = pd.read_csv('./economics.csv')

1.DataFrame的apply方法

1	df[['psavert','uempmed']].apply(lambda x:x.max()-x.min(), axis=1)#axis=1 将函数应用到列

Pandas的一些常用操作_3

Posted on 2021-02-21 | In Python |

Words count in article: 943

今天介绍Pandas对一些常见数据的处理方法。

1
2
3

import numpy as np
import pandas as pd
df = pd.read_csv('./economics.csv')

1.缺失数据处理

1	df.isna()#是否有缺失值

1	df.isna().mean()#缺失的比例