Note for python functions.

Counter()

most_common()

Numpy

sum

Calculate the summary of ndarray a according to the given axis, axis is a integer or tuple

argsort

返回数组的排序索引（从小到大），若为矩阵可指定轴。

k = 2
x = np.array([[4, 5, 1],
              [1, 2, 3]])
top_k_idx = np.argsort(x[0])[:k]
print(x)
print(top_k_idx)

top_k = np.argsort(x, axis=1)
print(top_k)

np.square

np.square是用c实现的，比**快多了😭。 661s:36s.

np.linalg.norm

求范数，L1范数->曼哈顿距离（L1距离），L2范数->欧氏距离（L2距离）np.linalg.norm

vstack, hstack

refer

array_split

concatenate

np.random.choice

生成随机序列，可指定范围或来源于某个数组。

reshape

改变矩阵为指定形状

mean

求平均值，可指定matrix的轴

fmax(x1, x2)

Compare two arrays and returns a new array containing the element-wise maxima. x2可以是一个数。

numpy.random.randn(d0, d1, …, dn)

从标准正态分布中返回一个或多个样本值。

numpy.random.rand(d0, d1, …, dn)

随机样本位于[0, 1)中。

\xa0 \n

>>> s
'T-shirt\xa0\xa0短袖圆领衫,体恤衫\xa0'
>>> out = "".join(s.split())
>>> out
'T-shirt短袖圆领衫,体恤衫'

list comprehension

new_list = [ expression(i) for i in old_list if filter(i)]
>>> data[0]
{'votes': {'funny': 0, 'useful': 5, 'cool': 2},
 'user_id': 'rLtl8ZkDX5vH5nAx9C3q5Q',
 'review_id': 'fWKvX83p0-ka4JS3dc6E5A'}
>>> votes = pd.DataFrame([i['votes'] for i in data]) # data is a list which element is dict.

np.random.shuffle()

In [26]: x = np.arange(10)

In [27]: x
Out[27]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [28]: np.random.shuffle(x)

In [29]: x
Out[29]: array([8, 3, 4, 7, 9, 5, 1, 6, 0, 2])

pandas Dataframe get rows by index array

def split_data(data):
    x_num = data.shape[0]
    x_idx = np.arange(x_num)
    np.random.shuffle(x_idx)
    x_idx_train = x_idx[0 : int(x_num*0.7)]
    x_idx_test = x_idx[int(x_num*0.7) : ]
    train = data.iloc[x_idx_train]
    test = data.iloc[x_idx_test]
    return train, test

str clean

1	str1 = ''.join(str1.split()) # remove '\n','\xa0'

pandas : add a column to a Daraframe

1	data_df['cool'] = votes['cool']

pandas : groupby mean

1	data_df.groupby('stars').mean()

pandas : group and count unique values

1	srcIp_Host_count = df.groupby(by='srcIp')['requestHost'].nunique()

pandas ：SettingwithCopyWarning

pandas : reindex() ， reset_index()

reindex() 是取出index为参数中指定的行
reset_index() 才是重置索引

[pandas : apply 也可用做遍历df的操作]

import re
def clear_character(item):
    '''去掉所有非中文字符'''
    pattern1='[a-zA-Z0-9]'
    pattern2 = '\[.*?\]'
    pattern3 = re.compile(u'[^\s1234567890:：' + '\u4e00-\u9fa5]+')
    pattern4='[’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+'
    if len(item['content']) == 0:
        item['content'] = item['title']
    line1=re.sub(pattern1,'',item['content'])   #去除英文字母和数字
    line2=re.sub(pattern2,'',line1)   #去除表情
    line3=re.sub(pattern3,'',line2)   #去除其它字符
    line4=re.sub(pattern4, '', line3) #去掉残留的冒号及其它符号
    item['content']=''.join(line4.split()) #去除空白
    return item

data = data.apply(clear_character, axis=1)

jupyter 允许外网访问

1	jupyter notebook --ip=<host_ip>