特征工程_1

特征工程就是将原始数据空间变换到新的特征空间,在新的特征空间中,模型能够更好地学习数据中的规律。特征的选择和构造,就是人为地帮助模型学习到原本很难学好的东西,从而使模型达到更好的效果。

1. 根据现实情况构造特征

1.1 各点与特定点的距离
1
2
3
4
5
df['x_dis'] = (df['x'] - 6165599).abs()
df['y_dis'] = (df['y'] - 5202660).abs()
df['base_dis] = (df['y_dis']**2))**0.5 + ((df['x_dis']**2)
del df['x_dis'],df['y_dis']
df['base_dis_diff'].head()
1.2 将时间划分为白天与黑夜
1
2
3
df['day_night'] = 0
df.loc[(df['hour'] > 5) & (df['hour'] < 20),'day_night'] = 1
df['day_night'].head()
1.3 将月份划分为季度
1
2
3
4
5
df['quarter'] = 0
df.loc[(df['month'].isin([1, 2, 3])), 'quarter'] = 1
df.loc[(df['month'].isin([4, 5, 6, ])), 'quarter'] = 2
df.loc[(df['month'].isin([7, 8, 9])), 'quarter'] = 3
df.loc[(df['month'].isin([10, 11, 12])), 'quarter'] = 4
1.4 特征变化量之间的相似性

①统计每个ship的对应速度等级的个数.
②对方位进行16均分.
③统计速度为0的个数,以及速度不为0的统计量.
④加入x,v,d,y的中位数和各种位数,并删去count\mean\min\max\std等多余统计特征.
⑤以shift为主键,求相邻差异值(偏移量).

2. 构造分箱特征

2.1 经纬度和速度的分箱特征
1
2
df['v_bin'] = pd.qcut(df['v'], 200, duplicates='drop') # 速度进行 200分位数分箱
df['v_bin'] = df['v_bin'].map(dict(zip(df['v_bin'].unique().range(df['v_bin'].nunique())))) # 分箱后映射编码
2.2 经纬度分箱后并构造区域
1
2
3
4
5
6
7
8
9
10
11
traj.sort_values(by='x', inplace=True)
x_res = np.zeros((len(traj), ))
j = 0
for i in range(1, col_bins + 1):
low, high = x_bins[i-1], x_bins[i]
while( j < len(traj)):
if (traj["x"].iloc[j] <= high) & (traj["x"].iloc[j] > low - 0.001):
x_res[j] = i
j += 1
else:
break

3. 构造DataFramte特征

3.1 count计数值
1
2
3
4
5
6
def find_save_unique_visit_count_table(traj_data_df=None, bin_to_coord_df=None):
unique_boat_count_df = traj_data_df.groupby(["no_bin"])["id"].nunique().reset_index()
unique_boat_count_df.rename({"id":"visit_boat_count"}, axis=1, inplace=True)
unique_boat_count_df_save = pd.merge(bin_to_coord_df, unique_boat_count_df,
on="no_bin", how="left")
return unique_boat_count_d
3.2 shift偏移量
1
2
3
4
5
for f in ['x', 'y']:
#对x,y坐标进行时间平移 1 -1 2
df[f + '_prev_diff'] = df[f] - g[f].shift(1)
df[f + '_next_diff'] = df[f] - g[f].shift(-1)
df[f + '_prev_next_diff'] = g[f].shift(1) - g[f].shift(-1)
3.3 统计特征
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def group_feature(df, key, target, aggs,flag):   
#通过字典的形式来构建方法和重命名
agg_dict = {}
for ag in aggs:
agg_dict['{}_{}_{}'.format(target,ag,flag)] = ag

t = df.groupby(key)[target].agg(agg_dict).reset_index()
return t

t = group_feature(df, 'ship','x',['max','min','mean','median','std','skew'],flag)
train = pd.merge(train, t, on='ship', how='left')
t = group_feature(df, 'ship','y',['max','min','mean','median','std','skew'],flag)
train = pd.merge(train, t, on='ship', how='left')
t = group_feature(df, 'ship','base_dis_diff',['max','min','mean','std','skew'],flag)
train = pd.merge(train, t, on='ship', how='left')

4.构造Embedding特征

word embedding就是将词映射到另外一个空间,相同类型的词在投影之后的向量空间距离更近,倾向于归到一起.

4.1 Word2vec构造词向量
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
for i in tqdm(range(num_runs)):
model = Word2Vec(sentences, size=embedding_size,
min_count=min_count,
workers=mp.cpu_count(),
window=window_size,
seed=seed, iter=iters, sg=0)

embedding_vec = []
for ind, seq in enumerate(sentences):
seq_vec, word_count = 0, 0
for word in seq:
if word not in model:
continue
else:
seq_vec += model[word]
word_count += 1
if word_count == 0:
embedding_vec.append(embedding_size * [0])
else:
embedding_vec.append(seq_vec / word_count)
4.2 NMF提取文本的主题分布

TF-IDF是衡量字词的出现频率来定义其重要性的加权技术.

1
2
3
4
5
6
# 使用tfidf对元素进行处理
tfidf_vectorizer = TfidfVectorizer(ngram_range=(tf_n,tf_n))
tfidf = tfidf_vectorizer.fit_transform(self.data['title_feature'].values)

#使用nmf算法,提取文本的主题分布
text_nmf = NMF(n_components=self.nmf_n).fit_transform(tfidf)
欢迎关注我们的公众号
0%