群里有个同学问了一个问题:
按照 string 的 intern 机制来说,用了 join() 方法按理来说内存地址会发生改变,可单字符的 join,内存地址并没有改变,大神可以帮忙解答下吗?
s1 = 'a'
s1_copy = ''.join(s1)
print(id(s1))
print(id(s1_copy))
结果:
2395173012208
2395173012208
当时我在群里回了他,
去看源代码啊, join 方法用的是 str.join,在 python3 里面底层用的是_PyUnicode_JoinArray,这个会判断,当需要 join 的字符串只有一个的时候,就直接返回自己了。
/* If singleton sequence with an exact Unicode, return that. */
last_obj = NULL;
if (seqlen == 1) {
if (PyUnicode_CheckExact(items[0])) {
res = items[0];
Py_INCREF(res);
return res;
}
seplen = 0;
maxchar = 0;
}
可能说的不太清楚,发个帖子详细讲一下:
首先下载 python 源码:https://github.com/python/cpython
找到 join 的文档:https://docs.python.org/3/library/stdtypes.html?highlight=join#str.join
嗯,是 build in 的 str,那就要找 str 的源码,in Python 3, all strings are sequences of Unicode characters. 那就直接找 unicodeobject.chttps://github.com/python/cpython/blob/master/Objects/unicodeobject.c
果然找到了一段注释:
/*[clinic input]
str.join as unicode_join
iterable: object
/
Concatenate any number of strings.
The string whose method is called is inserted in between each given string.
The result is returned as a new string.
Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'
[clinic start generated code]*/
继续看 unicode_join 方法:
static PyObject *
unicode_join(PyObject *self, PyObject *iterable)
/*[clinic end generated code: output=6857e7cecfe7bf98 input=2f70422bfb8fa189]*/
{
return PyUnicode_Join(self, iterable);
}
看 PyUnicode_Join 方法:
PyObject *
PyUnicode_Join(PyObject *separator, PyObject *seq)
{
PyObject *res;
PyObject *fseq;
Py_ssize_t seqlen;
PyObject **items;
fseq = PySequence_Fast(seq, "can only join an iterable");
if (fseq == NULL) {
return NULL;
}
/* NOTE: the following code can't call back into Python code,
* so we are sure that fseq won't be mutated.
*/
items = PySequence_Fast_ITEMS(fseq);
seqlen = PySequence_Fast_GET_SIZE(fseq);
res = _PyUnicode_JoinArray(separator, items, seqlen);
Py_DECREF(fseq);
return res;
}
看 _PyUnicode_JoinArray 方法:
PyObject *
_PyUnicode_JoinArray(PyObject *separator, PyObject *const *items, Py_ssize_t seqlen)
{
PyObject *res = NULL; /* the result */
PyObject *sep = NULL;
Py_ssize_t seplen;
PyObject *item;
Py_ssize_t sz, i, res_offset;
Py_UCS4 maxchar;
Py_UCS4 item_maxchar;
int use_memcpy;
unsigned char *res_data = NULL, *sep_data = NULL;
PyObject *last_obj;
unsigned int kind = 0;
/* If empty sequence, return u"". */
if (seqlen == 0) {
_Py_RETURN_UNICODE_EMPTY();
}
/* If singleton sequence with an exact Unicode, return that.
这里就是关键了,只有一个的时候返回自己。
*/
last_obj = NULL;
if (seqlen == 1) {
if (PyUnicode_CheckExact(items[0])) {
res = items[0];
Py_INCREF(res);
return res;
}
seplen = 0;
maxchar = 0;
}
else {
......
}
.....
}
所以,当 ''.join("a") 的时候,其实返回的就是 "a" 自己,地址一样就不稀奇了。