learn python

Note for python functions.

Counter()

most_common()

Numpy

sum

Calculate the summary of ndarray a according to the given axis, axis is a integer or tuple

argsort

返回数组的排序索引(从小到大),若为矩阵可指定轴。

1
2
3
4
5
6
7
8
9
k = 2
x = np.array([[4, 5, 1],
[1, 2, 3]])
top_k_idx = np.argsort(x[0])[:k]
print(x)
print(top_k_idx)

top_k = np.argsort(x, axis=1)
print(top_k)

np.square

np.square是用c实现的,比**快多了😭。 661s:36s.

np.linalg.norm

范数,L1范数->曼哈顿距离(L1距离),L2范数->欧氏距离(L2距离)np.linalg.norm

vstack, hstack

refer

array_split

concatenate

np.random.choice

生成随机序列,可指定范围或来源于某个数组。

reshape

改变矩阵为指定形状

mean

求平均值,可指定matrix的轴

fmax(x1, x2)

Compare two arrays and returns a new array containing the element-wise maxima. x2可以是一个数。

numpy.random.randn(d0, d1, …, dn)

从标准正态分布中返回一个或多个样本值。

numpy.random.rand(d0, d1, …, dn)

随机样本位于[0, 1)中。

\xa0 \n

1
2
3
4
5
>>> s
'T-shirt\xa0\xa0短袖圆领衫,体恤衫\xa0'
>>> out = "".join(s.split())
>>> out
'T-shirt短袖圆领衫,体恤衫'

list comprehension

1
2
3
4
5
6
new_list = [ expression(i) for i in old_list if filter(i)]
>>> data[0]
{'votes': {'funny': 0, 'useful': 5, 'cool': 2},
'user_id': 'rLtl8ZkDX5vH5nAx9C3q5Q',
'review_id': 'fWKvX83p0-ka4JS3dc6E5A'}
>>> votes = pd.DataFrame([i['votes'] for i in data]) # data is a list which element is dict.

np.random.shuffle()

1
2
3
4
5
6
7
8
9
In [26]: x = np.arange(10)

In [27]: x
Out[27]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [28]: np.random.shuffle(x)

In [29]: x
Out[29]: array([8, 3, 4, 7, 9, 5, 1, 6, 0, 2])

pandas Dataframe get rows by index array

1
2
3
4
5
6
7
8
9
def split_data(data):
x_num = data.shape[0]
x_idx = np.arange(x_num)
np.random.shuffle(x_idx)
x_idx_train = x_idx[0 : int(x_num*0.7)]
x_idx_test = x_idx[int(x_num*0.7) : ]
train = data.iloc[x_idx_train]
test = data.iloc[x_idx_test]
return train, test

str clean

1
str1 = ''.join(str1.split()) # remove '\n','\xa0'

pandas : add a column to a Daraframe

1
data_df['cool'] = votes['cool']

pandas : groupby mean

1
data_df.groupby('stars').mean()

pandas : group and count unique values

1
srcIp_Host_count = df.groupby(by='srcIp')['requestHost'].nunique()

pandas :SettingwithCopyWarning

pandas : reindex() , reset_index()

reindex() 是取出index为参数中指定的行
reset_index() 才是重置索引

[pandas : apply 也可用做遍历df的操作]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import re
def clear_character(item):
'''去掉所有非中文字符'''
pattern1='[a-zA-Z0-9]'
pattern2 = '\[.*?\]'
pattern3 = re.compile(u'[^\s1234567890::' + '\u4e00-\u9fa5]+')
pattern4='[’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+'
if len(item['content']) == 0:
item['content'] = item['title']
line1=re.sub(pattern1,'',item['content']) #去除英文字母和数字
line2=re.sub(pattern2,'',line1) #去除表情
line3=re.sub(pattern3,'',line2) #去除其它字符
line4=re.sub(pattern4, '', line3) #去掉残留的冒号及其它符号
item['content']=''.join(line4.split()) #去除空白
return item

data = data.apply(clear_character, axis=1)

jupyter 允许外网访问

1
jupyter notebook --ip=<host_ip>