-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-121459: Deferred LOAD_GLOBAL #123128
gh-121459: Deferred LOAD_GLOBAL #123128
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can only nitpick on things I understand :') But I will, at some point, try to understand the magic of bytecodes.c
.
@@ -47,10 +47,10 @@ def write_header( | |||
) | |||
|
|||
|
|||
def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str) -> None: | |||
def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, allow_unbalanced_parens: bool = False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, allow_unbalanced_parens: bool = False) -> None: | |
def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, *, allow_unbalanced_parens: bool = False) -> None: |
Objects/dictobject.c
Outdated
goto read_failed; | ||
} | ||
|
||
if (ix >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put the case if (ix < 0)
first to reflect the if (rc < 0)
constructions that we usually do?
Objects/dictobject.c
Outdated
ix = _Py_dict_lookup(mp, key, hash, &value); | ||
*value_addr = value == NULL ? PyStackRef_NULL : PyStackRef_FromPyObjectNew(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you use the _Py_dict_lookup_threadsafe_stackref
helper here since you acquired the lock? This would save the PyObject *value
declaration (and you could consider inlining the helper actually)
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
…ython into deferred_globals_final
Objects/dictobject.c
Outdated
goto read_failed; | ||
} | ||
if (ix >= 0) { | ||
*value_addr = PyStackRef_FromPyObjectNew(DK_ENTRIES(dk)[ix].me_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure I understand the TSAN failures on dictionaries here. @colesbury do you have a clue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not thread-safe it the value is not deferred or immortal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait but doesn't PyStackRef_FromPyObjectNew
incref the value if it's not deferred or immortal, which make it thread safe? Or maybe I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling Py_INCREF()
on a value that might be concurrently freed is not thread-safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh okay that makes sense now. So one possible solution would be to introduce another variant that tries to incref and is allowed to fail? Like in https://github.com/python/cpython/blob/main/Objects/dictobject.c#L1435
Or we can check if it's deferred/immortal, and only if it's not we do the TryXGetRef thing, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, do the check that it's deferred/immortal and if not do the TryXGetRef
thing.
I think it's best to write it out explicitly first. If there are other uses of it, we can refactor it into a function later.
Objects/dictobject.c
Outdated
dk = _Py_atomic_load_ptr(&mp->ma_keys); | ||
kind = dk->dk_kind; | ||
|
||
if (kind != DICT_KEYS_GENERAL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is complicated and duplicates logic elsewhere. I think it might be worth only supporting a a single case or two. For example, unicode-only and non-split dict. Fallback to _Py_dict_lookup_threadsafe
+ PyStackRef_FromPyObjectSteal
for other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is worth changing anything. LOAD_GLOBAL
is specialized incredibly well, so this only applies to a rare slow path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not specializing at all in the free-threaded build so LOAD_GLOBAL
is a scaling bottleneck now that fewer types are immortalized.
I don't think we want to wait for specialization to be made thread-safe, but we can treat this as a temporary measure and structure the code to reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this case to fall back to _Py_dict_lookup_threadsafe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks backwards to me. The common case should be DICT_KEYS_UNICODE
for globals dictionaries, but this only handles DICT_KEYS_GENERAL
efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still too complicated. We only need to handle the common case of unicode keys. Globals are almost never split dictionaries.
I think this should look something like the following. You can refactor it further by extracting a _Py_TryXGetStackRef()
that's like _Py_TryXGetRef
but returns a _PyStackRef
.
PyDictKeysObject *dk = _Py_atomic_load_ptr(&mp->ma_keys);
if (dk->dk_kind == DICT_KEYS_UNICODE && PyUnicode_CheckExact(key)) {
Py_ssize_t ix = unicodekeys_lookup_unicode_threadsafe(dk, key, hash);
if (ix == DKIX_EMPTY) {
*value_addr = PyStackRef_NULL;
return ix;
}
else if (ix >= 0) {
PyObject **addr_of_value = &DK_UNICODE_ENTRIES(dk)[ix].me_value;
PyObject *value = _Py_atomic_load_ptr(addr_of_value);
if (value == NULL) {
*value_addr = PyStackRef_NULL;
return DKIX_EMPTY;
}
if (_Py_IsImmortal(value) || _PyObject_HasDeferredRefcount(value)) {
*value_addr = (_PyStackRef){ .bits = (uintptr_t)value | Py_TAG_DEFERRED };
return ix;
}
if (_Py_TryIncrefCompare(addr_of_value, value)) {
*value_addr = PyStackRef_FromPyObjectSteal(value);
return ix;
}
}
}
PyObject *obj;
Py_ssize_t ix = _Py_dict_lookup_threadsafe(mp, key, hash, &obj);
if (ix >= 0 && obj != NULL) {
*value_addr = PyStackRef_FromPyObjectSteal(obj);
}
else {
*value_addr = PyStackRef_NULL;
}
return ix;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see the point of these changes. LOAD_GLOBAL
is something like 1 in 5000 of executed instructions. Its performance is irrelevant.
What is STACK_ENTRY
for? All outputs are already declared in the stack comment for the op.
Include/internal/pycore_dict.h
Outdated
|
||
extern Py_ssize_t _PyDict_LookupIndex(PyDictObject *, PyObject *); | ||
extern Py_ssize_t _PyDictKeys_StringLookup(PyDictKeysObject* dictkeys, PyObject *key); | ||
PyAPI_FUNC(PyObject *)_PyDict_LoadGlobal(PyDictObject *, PyDictObject *, PyObject *); | ||
PyAPI_FUNC(void)_PyDict_LoadGlobalStackRef(PyDictObject *, PyDictObject *, PyObject *, _PyStackRef *); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyAPI_FUNC(void)_PyDict_LoadGlobalStackRef(PyDictObject *, PyDictObject *, PyObject *, _PyStackRef *); | |
PyAPI_FUNC(_PyStackRef )_PyDict_LoadGlobalStackRef(PyDictObject *, PyDictObject *, PyObject *); |
Likewise
Python/bytecodes.c
Outdated
(PyDictObject *)BUILTINS(), | ||
name); | ||
if (v_o == NULL) { | ||
_PyDict_LoadGlobalStackRef((PyDictObject *)GLOBALS(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inconsistent with PyMapping_GetOptionalItem
above.
I think it would be better to change the whole instruction to use stack refs, or leave it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed stackrefs from the instruction to leave it as is.
Python/bytecodes.c
Outdated
_PyDict_LoadGlobalStackRef((PyDictObject *)GLOBALS(), | ||
(PyDictObject *)BUILTINS(), | ||
name, | ||
STACK_ENTRY(v)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does STACK_ENTRY
do? Does it have any side effects, or is it just equivalent to &v
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Practically equivalent to &stack_pointer[whatever_v_index_is]
. It has no side effects.
@@ -47,10 +47,10 @@ def write_header( | |||
) | |||
|
|||
|
|||
def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str) -> None: | |||
def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, *, allow_unbalanced_parens: bool = False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want balanced parentheses? What is the construct that doesn't balance parentheses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically it's when we replace something in the middle of the line but still want the rest of the line (and all it's parantheses)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give me an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. we replace the before foo(STACK_ENTRY(x));
, with foo(stack_pointer[-1]);
, we want the replacement function to write until the semicolon ;
, ignoring all balanced brackets from the replacement point till that semicolon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding allow_unbalanced_parens
can we push the logic into the caller (stack_entry
)? Like replace up to the matching )
and then consume/emit tokens up to the end of the statement?
Objects/dictobject.c
Outdated
dk = _Py_atomic_load_ptr(&mp->ma_keys); | ||
kind = dk->dk_kind; | ||
|
||
if (kind != DICT_KEYS_GENERAL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is worth changing anything. LOAD_GLOBAL
is specialized incredibly well, so this only applies to a rare slow path.
@colesbury I think the PR is in better shape now. Can you take a look please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets ensure that this actually improves the scalability of LOAD_GLOBAL
calls.
Python/bytecodes.c
Outdated
@@ -1507,10 +1507,9 @@ dummy_func( | |||
|
|||
op(_LOAD_GLOBAL, ( -- res, null if (oparg & 1))) { | |||
PyObject *name = GETITEM(FRAME_CO_NAMES, oparg>>1); | |||
PyObject *res_o = _PyEval_LoadGlobal(GLOBALS(), BUILTINS(), name); | |||
ERROR_IF(res_o == NULL, error); | |||
_PyEval_LoadGlobalStackRef(GLOBALS(), BUILTINS(), name, STACK_ENTRY(res)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two suggestions:
- Make
_PyEval_LoadGlobalStackRef
and_PyDict_LoadGlobalStackRef
returnint
error codes - Use
&STACK_ENTRY(res)
because it makes it more clear that the value is a pointer and because it means theSTACK_ENTRY()
transformation preserves the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will apply 2. but not 1. The reasoning is to keep consistency with _PyEval_LoadGlobal (non-stackref version).
@@ -47,10 +47,10 @@ def write_header( | |||
) | |||
|
|||
|
|||
def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str) -> None: | |||
def emit_to(out: CWriter, tkn_iter: Iterator[Token], end: str, *, allow_unbalanced_parens: bool = False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding allow_unbalanced_parens
can we push the logic into the caller (stack_entry
)? Like replace up to the matching )
and then consume/emit tokens up to the end of the statement?
Objects/dictobject.c
Outdated
dk = _Py_atomic_load_ptr(&mp->ma_keys); | ||
kind = dk->dk_kind; | ||
|
||
if (kind != DICT_KEYS_GENERAL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks backwards to me. The common case should be DICT_KEYS_UNICODE
for globals dictionaries, but this only handles DICT_KEYS_GENERAL
efficiently.
Main:
This branch:
Create_pyobject became a lot faster! |
Benchmarking results courtesy of Mike: no slowdown on JIT build https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20240828-3.14.0a0-bfd4400-JIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in reviewing this. I think we should simplify this further to only focus on the case we care about.
Objects/dictobject.c
Outdated
dk = _Py_atomic_load_ptr(&mp->ma_keys); | ||
kind = dk->dk_kind; | ||
|
||
if (kind != DICT_KEYS_GENERAL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still too complicated. We only need to handle the common case of unicode keys. Globals are almost never split dictionaries.
I think this should look something like the following. You can refactor it further by extracting a _Py_TryXGetStackRef()
that's like _Py_TryXGetRef
but returns a _PyStackRef
.
PyDictKeysObject *dk = _Py_atomic_load_ptr(&mp->ma_keys);
if (dk->dk_kind == DICT_KEYS_UNICODE && PyUnicode_CheckExact(key)) {
Py_ssize_t ix = unicodekeys_lookup_unicode_threadsafe(dk, key, hash);
if (ix == DKIX_EMPTY) {
*value_addr = PyStackRef_NULL;
return ix;
}
else if (ix >= 0) {
PyObject **addr_of_value = &DK_UNICODE_ENTRIES(dk)[ix].me_value;
PyObject *value = _Py_atomic_load_ptr(addr_of_value);
if (value == NULL) {
*value_addr = PyStackRef_NULL;
return DKIX_EMPTY;
}
if (_Py_IsImmortal(value) || _PyObject_HasDeferredRefcount(value)) {
*value_addr = (_PyStackRef){ .bits = (uintptr_t)value | Py_TAG_DEFERRED };
return ix;
}
if (_Py_TryIncrefCompare(addr_of_value, value)) {
*value_addr = PyStackRef_FromPyObjectSteal(value);
return ix;
}
}
}
PyObject *obj;
Py_ssize_t ix = _Py_dict_lookup_threadsafe(mp, key, hash, &obj);
if (ix >= 0 && obj != NULL) {
*value_addr = PyStackRef_FromPyObjectSteal(obj);
}
else {
*value_addr = PyStackRef_NULL;
}
return ix;
Co-Authored-By: Sam Gross <655866+colesbury@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we go back to defining res
as an array of one.
Python/bytecodes.c
Outdated
@@ -1507,10 +1507,9 @@ dummy_func( | |||
|
|||
op(_LOAD_GLOBAL, ( -- res, null if (oparg & 1))) { | |||
PyObject *name = GETITEM(FRAME_CO_NAMES, oparg>>1); | |||
PyObject *res_o = _PyEval_LoadGlobal(GLOBALS(), BUILTINS(), name); | |||
ERROR_IF(res_o == NULL, error); | |||
_PyEval_LoadGlobalStackRef(GLOBALS(), BUILTINS(), name, &STACK_ENTRY(res)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_PyEval_LoadGlobalStackRef(GLOBALS(), BUILTINS(), name, &STACK_ENTRY(res)); | |
_PyEval_LoadGlobalStackRef(GLOBALS(), BUILTINS(), name, &res[0])); |
Can we go back to the earlier design of making res
an array?
Please also add a comment explaining why we res
needs to be trated specially (we need pass a pointer to _PyEval_LoadGlobalStackRef
).
The code generator already understands arrays, and it avoids confusion as to where it should insert stack spills for fully deferred reference counting.
You'll able to remove def stack_entry
below.
Sorry for the back and forth on this.
@@ -211,6 +212,37 @@ def py_stack_ref_from_py_object_new( | |||
# unused portions of the stack to NULL. | |||
stack.flush_single_var(self.out, target, uop.stack.outputs) | |||
|
|||
def stack_entry( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that the stack calculations are correct here, and I don't think it will be able to handle code that writes to multiple results, like UNPACK...
.
If we go back to using arrays, we can remove this.
When you're done making the requested changes, leave the comment: |
|
This reverts commit 8810e28.
This reverts commit 8810e28.
This implements PEP 703 deferred refcounting for LOAD_GLOBAL in an agnostic way using the cases generator. So it won't block full deferring of references in the future.
For now, we need to write directly to a stackref entry to keep the global alive. Changing the dictionary lookup to stackref variants is also a prerequisite for LOAD_ATTR and friends.