特征工程就是将原始数据空间变换到新的特征空间,在新的特征空间中,模型能够更好地学习数据中的规律。特征的选择和构造,就是人为地帮助模型学习到原本很难学好的东西,从而使模型达到更好的效果。

1. 根据现实情况构造特征

1.1 各点与特定点的距离
1 2 3 4 5 df['x_dis'] = (df['x'] - 6165599).abs() df['y_dis'] = (df['y'] - 5202660).abs() df['base_dis] = (df['y_dis']**2))**0.5 + ((df['x_dis']**2) del df['x_dis'],df['y_dis'] df['base_dis_diff'].head()
1.2 将时间划分为白天与黑夜
1 2 3 df['day_night'] = 0 df.loc[(df['hour'] > 5) & (df['hour'] < 20),'day_night'] = 1 df['day_night'].head()
1.3 将月份划分为季度
1 2 3 4 5 df['quarter'] = 0 df.loc[(df['month'].isin([1, 2, 3])), 'quarter'] = 1 df.loc[(df['month'].isin([4, 5, 6, ])), 'quarter'] = 2 df.loc[(df['month'].isin([7, 8, 9])), 'quarter'] = 3 df.loc[(df['month'].isin([10, 11, 12])), 'quarter'] = 4
1.4 特征变化量之间的相似性

①统计每个ship的对应速度等级的个数.
②对方位进行16均分.
③统计速度为0的个数,以及速度不为0的统计量.
④加入x,v,d,y的中位数和各种位数,并删去count\mean\min\max\std等多余统计特征.
⑤以shift为主键,求相邻差异值(偏移量).

2. 构造分箱特征

2.1 经纬度和速度的分箱特征
1 2 df['v_bin'] = pd.qcut(df['v'], 200, duplicates='drop') # 速度进行 200分位数分箱 df['v_bin'] = df['v_bin'].map(dict(zip(df['v_bin'].unique().range(df['v_bin'].nunique())))) # 分箱后映射编码
2.2 经纬度分箱后并构造区域
1 2 3 4 5 6 7 8 9 10 11 traj.sort_values(by='x', inplace=True) x_res = np.zeros((len(traj), )) j = 0 for i in range(1, col_bins + 1): low, high = x_bins[i-1], x_bins[i] while( j < len(traj)): if (traj["x"].iloc[j] <= high) & (traj["x"].iloc[j] > low - 0.001): x_res[j] = i j += 1 else: break

3. 构造DataFramte特征

3.1 count计数值
1 2 3 4 5 6 def find_save_unique_visit_count_table(traj_data_df=None, bin_to_coord_df=None): unique_boat_count_df = traj_data_df.groupby(["no_bin"])["id"].nunique().reset_index() unique_boat_count_df.rename({"id":"visit_boat_count"}, axis=1, inplace=True) unique_boat_count_df_save = pd.merge(bin_to_coord_df, unique_boat_count_df, on="no_bin", how="left") return unique_boat_count_d
3.2 shift偏移量
1 2 3 4 5 for f in ['x', 'y']: #对x,y坐标进行时间平移 1 -1 2 df[f + '_prev_diff'] = df[f] - g[f].shift(1) df[f + '_next_diff'] = df[f] - g[f].shift(-1) df[f + '_prev_next_diff'] = g[f].shift(1) - g[f].shift(-1)
3.3 统计特征
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 def group_feature(df, key, target, aggs,flag): #通过字典的形式来构建方法和重命名 agg_dict = {} for ag in aggs: agg_dict['{}_{}_{}'.format(target,ag,flag)] = ag t = df.groupby(key)[target].agg(agg_dict).reset_index() return t t = group_feature(df, 'ship','x',['max','min','mean','median','std','skew'],flag) train = pd.merge(train, t, on='ship', how='left') t = group_feature(df, 'ship','y',['max','min','mean','median','std','skew'],flag) train = pd.merge(train, t, on='ship', how='left') t = group_feature(df, 'ship','base_dis_diff',['max','min','mean','std','skew'],flag) train = pd.merge(train, t, on='ship', how='left')

4.构造Embedding特征

word embedding就是将词映射到另外一个空间,相同类型的词在投影之后的向量空间距离更近,倾向于归到一起.

4.1 Word2vec构造词向量
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 for i in tqdm(range(num_runs)): model = Word2Vec(sentences, size=embedding_size, min_count=min_count, workers=mp.cpu_count(), window=window_size, seed=seed, iter=iters, sg=0) embedding_vec = [] for ind, seq in enumerate(sentences): seq_vec, word_count = 0, 0 for word in seq: if word not in model: continue else: seq_vec += model[word] word_count += 1 if word_count == 0: embedding_vec.append(embedding_size * [0]) else: embedding_vec.append(seq_vec / word_count)
4.2 NMF提取文本的主题分布

TF-IDF是衡量字词的出现频率来定义其重要性的加权技术.

1 2 3 4 5 6 # 使用tfidf对元素进行处理 tfidf_vectorizer = TfidfVectorizer(ngram_range=(tf_n,tf_n)) tfidf = tfidf_vectorizer.fit_transform(self.data['title_feature'].values) #使用nmf算法,提取文本的主题分布 text_nmf = NMF(n_components=self.nmf_n).fit_transform(tfidf)